What is the difference between groupByKey and reduceByKey in Spark? - Big Data In Real World

What is the difference between groupByKey and reduceByKey in Spark?

How to send large messages in Kafka?
April 5, 2021
How to solve word count problem in Hive?
April 9, 2021
How to send large messages in Kafka?
April 5, 2021
How to solve word count problem in Hive?
April 9, 2021

Both reduceByKey and groupByKey result in wide transformations which means both triggers a shuffle operation.

The key difference between reduceByKey and groupByKey is that reduceByKey does a map side combine and groupByKey does not do a map side combine.

Let’s say we are computing word count on a file with below line

RED GREEN RED RED

At runtime let’s say we end up with 2 partitions.

Partition 1
RED
GREEN
Partition 2
RED
RED

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

groupByKey()

With groupByKey() on the above word count problem, since we have two partitions we will end up with 2 tasks. The output of the tasks will look like below –

Task 1
RED, 1
GREEN, 1
Task 2
RED, 1
RED, 1

All the 4 elements from Task 1 and 2 will be sent over the network to the Task performing the reduce operation.

Task performing reduce

RED, 1
GREEN, 1
RED, 1
RED, 1

Yielding the result

GREEN 1
RED 3

There are two problems with this –

Since the data is not combined or reduced on the map side, we transferred all elements over the network during shuffle.

Since all elements are sent to the task performing the aggregate operation, the number of elements to be handled by the task will be more and could possibly result in an Out of Memory exception.

reduceByKey()

reduceByKey is optimized with a map side combine.

Just like groupByKey(), on the same word count problem, since we have two partitions we will end up with 2 tasks. However with a map side combine, the output of the tasks will look like below –

Task 1
RED, 1
GREEN, 1
Task 2
RED, 2

With reduceByKey only 3 elements from Task 1 and 2 will be sent over the network to the Task performing the reduce operation.

Task performing reduce

RED, 1
GREEN, 1
RED, 2

Yielding the result

GREEN 1
RED 3

reduceByKey() is optimized due to the map side combine. There by sending fewer elements over the network.

Also fewer elements are reduced on the task performing the reduce operation after the shuffle.

Should I use groupByKey or reduceByKey?

Use reduceByKey as it performs map side combine which reduces the amount of data sent over the network during shuffle and thereby also reduces the amount of data reduced.

Is there a good reason to use groupByKey?

Use groupByKey() when map side combine is not necessary and could cause adverse effects to the output.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between groupByKey and reduceByKey in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X