How to save Spark DataFrame directly to a Hive table?
May 31, 2021How to export a Hive table into a CSV file?
August 6, 2021In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.
Let’s see how Spark decides on the number of tasks with the below set of instructions.
- READ dataset_X
- FILTER on dataset_X
- MAP operation on dataset_X
- READ dataset_Y
- MAP operation on dataset_Y
- JOIN dataset_X and dataset_Y
- FILTER on joined dataset
- SAVE the output
Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Stages and number of tasks per stage
Spark will create 3 stages –
First stage – Instructions 1, 2 and 3
Second stage – Instructions 4 and 5
Third stage – Instructions 6, 7 and 8
Number of tasks in first stage
First stage reads dataset_X and dataset_X has 10 partitions. So stage 1 will result in 10 tasks.
If your dataset is very small, you might see Spark still creates 2 tasks and this is because Spark looks at the defaultMinPartitions property and this property decides the minimum number of tasks Spark can create. The default for defaultMinPartitions is 2.
Number of tasks in second stage
Second stage reads dataset_Y and dataset_Y has 5 partitions. So stage 2 will result in 5 tasks.
Number of tasks in third stage
Third stage executes a JOIN and JOIN operation triggers a wide transformation and wide transformation will result in a shuffle. Spark optimizer tries to pick the “right” number of partitions during a shuffle but most often you will see Spark creates 200 tasks for stages executing wide transformation operations like JOIN, GROUP BY etc.
spark.sql.shuffle.partitions property controls the number of partitions during a shuffle and the default value of this property is 200.
Change the value of spark.sql.shuffle.partitions to change the number of partitions during a shuffle.
sqlContext.setConf("spark.sql.shuffle.partitions", "8”)
Number of tasks execution in parallel
Number of CPU cores available for an executor determines the number of tasks that can be executed in parallel for an application for any given time.
Let’s say, you have 5 executors available for your application. Each executor is assigned 10 CPU cores.
5 executors and 10 CPU cores per executor = 50 CPU cores available in total.
With the above setup, Spark can execute a maximum of 50 tasks in parallel at any given time.
1 Comment
[…] Number of tasks equals the number of partitions in a dataset. Check this for more details. […]