What is the difference between repartition and coalesce in Spark? - Big Data In Real World

What is the difference between repartition and coalesce in Spark?

How to fix the Hive metastore database is not initialized issue?
December 18, 2020
How to convert List to a JavaRDD in Spark?
December 23, 2020
How to fix the Hive metastore database is not initialized issue?
December 18, 2020
How to convert List to a JavaRDD in Spark?
December 23, 2020

The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame.

Repartition

  • Increase or decrease partitions.
  • Repartition always involves a shuffle.
  • Repartition works by creating new partitions and doing a full shuffle to move data around.
  • Results in more or less equal sized partitions.
  • Since a full shuffle takes place, repartition is less performant than coalesce.

Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>

Coalesce

  • Only decrease the number of partitions.
  • Coalesce doesn’t involve a full shuffle.
  • If the number of partitions is reduced from 5 to 2. Coalesce will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors. Thereby avoiding a full shuffle.
  • Because of the above reason the partition size vary by a high degree.
  • Since full shuffle is avoided, coalesce is more performant than repartition.

Finally, When you call the repartition() function, Spark internally calls the coalesce function with shuffle parameter set to true.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between repartition and coalesce in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X