What is the difference between client and cluster deploy modes in Spark?
July 24, 2023Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More
August 7, 2023Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have a direct effect on performance.
spark.sql.shuffle.partitions
spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. The default for this this property is 200.
spark.default.parallelism
spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user.
spark.default.parallelism works only with RDDs and is ignored when working with DataFrames.
Default value depend on your deployment –
- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger