How to send large messages in Kafka?
April 5, 2021How to solve word count problem in Hive?
April 9, 2021Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation.
The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine.
Let’s say we are computing word count on a file with below line
RED GREEN RED RED
At runtime let’s say we end up with 2 partitions.
Partition 1 RED GREEN Partition 2 RED RED
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
groupByKey()
With groupByKey() on the above word count problem, since we have two partitions we will end up with 2 tasks. The output of the tasks will look like below –
Task 1 RED, 1 GREEN, 1 Task 2 RED, 1 RED, 1
All the 4 elements from Task 1 and 2 will be sent over the network to the Task performing the reduce operation.
Task performing reduce
RED, 1 GREEN, 1 RED, 1 RED, 1
Yielding the result
GREEN 1 RED 3
There are two problems with this –
Since the data is not combined or reduced on the map side, we transferred all elements over the network during shuffle.
Since all elements are sent to the task performing the aggregate operation, the number of elements to be handled by the task will be more and could possibly result in an Out of Memory exception.
reduceByKey()
reduceByKey is optimized with a map side combine.
Just like groupByKey(), on the same word count problem, since we have two partitions we will end up with 2 tasks. However with a map side combine, the output of the tasks will look like below –
Task 1 RED, 1 GREEN, 1 Task 2 RED, 2
With reduceByKey only 3 elements from Task 1 and 2 will be sent over the network to the Task performing the reduce operation.
Task performing reduce
RED, 1 GREEN, 1 RED, 2
Yielding the result
GREEN 1 RED 3
reduceByKey() is optimized due to the map side combine. There by sending fewer elements over the network.
Also fewer elements are reduced on the task performing the reduce operation after the shuffle.
Should I use groupByKey or reduceByKey?
Use reduceByKey as it performs map side combine which reduces the amount of data sent over the network during shuffle and thereby also reduces the amount of data reduced.
Is there a good reason to use groupByKey?
Use groupByKey() when map side combine is not necessary and could cause adverse effects to the output.