How to fix the Hive metastore database is not initialized issue?
December 18, 2020How to convert List to a JavaRDD in Spark?
December 23, 2020The function of repartition and coalesce functions in Spark is to change the number of partitions on a DataFrame.
Repartition
- Increase or decrease partitions.
- Repartition always involves a shuffle.
- Repartition works by creating new partitions and doing a full shuffle to move data around.
- Results in more or less equal sized partitions.
- Since a full shuffle takes place, repartition is less performant than coalesce.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Coalesce
- Only decrease the number of partitions.
- Coalesce doesn’t involve a full shuffle.
- If the number of partitions is reduced from 5 to 2. Coalesce will not move data in 2 executors and move the data from the remaining 3 executors to the 2 executors. Thereby avoiding a full shuffle.
- Because of the above reason the partition size vary by a high degree.
- Since full shuffle is avoided, coalesce is more performant than repartition.
Finally, When you call the repartition() function, Spark internally calls the coalesce function with shuffle parameter set to true.