What is the difference between repartition and coalesce in Spark?

How to fix the Hive metastore database is not initialized issue?

December 18, 2020

How to convert List to a JavaRDD in Spark?

December 23, 2020

Published by Big Data In Real World at December 21, 2020

Repartition

Increase or decrease partitions.
Repartition always involves a shuffle.
Repartition works by creating new partitions and doing a full shuffle to move data around.
Results in more or less equal sized partitions.
Since a full shuffle takes place, repartition is less performant than coalesce.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Coalesce

Only decrease the number of partitions.
Coalesce doesn’t involve a full shuffle.
If the number of partitions is reduced from 5 to 2. Coalesce will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors. Thereby avoiding a full shuffle.
Because of the above reason the partition size vary by a high degree.
Since full shuffle is avoided, coalesce is more performant than repartition.

Finally, When you call the repartition() function, Spark internally calls the coalesce function with shuffle parameter set to true.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between repartition and coalesce in Spark?

How to fix the Hive metastore database is not initialized issue?

How to convert List to a JavaRDD in Spark?

How to fix the Hive metastore database is not initialized issue?

How to convert List to a JavaRDD in Spark?

Repartition

Coalesce

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?