What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark? - Big Data In Real World

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

What is the difference between client and cluster deploy modes in Spark?
July 24, 2023
Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More
August 7, 2023
What is the difference between client and cluster deploy modes in Spark?
July 24, 2023
Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More
August 7, 2023

Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have a direct effect on performance.

spark.sql.shuffle.partitions

spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. The default for this this property is 200.

spark.default.parallelism

spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. 

spark.default.parallelism works only with RDDs and is ignored when working with DataFrames.

Default value depend on your deployment –

  • Local mode: number of cores on the local machine
  • Mesos fine grained mode: 8
  • Others: total number of cores on all executor nodes or 2, whichever is larger
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X