How does Spark decide the number of tasks and number of tasks to execute in parallel?

How to save Spark DataFrame directly to a Hive table?

May 31, 2021

How to export a Hive table into a CSV file?

August 6, 2021

Published by Big Data In Real World at August 4, 2021

Stages and number of tasks per stage

Spark will create 3 stages –

First stage – Instructions 1, 2 and 3

Second stage – Instructions 4 and 5

Third stage – Instructions 6, 7 and 8

Number of tasks in first stage

First stage reads dataset_X and dataset_X has 10 partitions. So stage 1 will result in 10 tasks.

If your dataset is very small, you might see Spark still creates 2 tasks and this is because Spark looks at the defaultMinPartitions property and this property decides the minimum number of tasks Spark can create. The default for defaultMinPartitions is 2.

Number of tasks in second stage

Second stage reads dataset_Y and dataset_Y has 5 partitions. So stage 2 will result in 5 tasks.

Number of tasks in third stage

Third stage executes a JOIN and JOIN operation triggers a wide transformation and wide transformation will result in a shuffle. Spark optimizer tries to pick the “right” number of partitions during a shuffle but most often you will see Spark creates 200 tasks for stages executing wide transformation operations like JOIN, GROUP BY etc.

spark.sql.shuffle.partitions property controls the number of partitions during a shuffle and the default value of this property is 200.

Change the value of spark.sql.shuffle.partitions to change the number of partitions during a shuffle.

sqlContext.setConf("spark.sql.shuffle.partitions", "8”)

Number of tasks execution in parallel

Number of CPU cores available for an executor determines the number of tasks that can be executed in parallel for an application for any given time.

Let’s say, you have 5 executors available for your application. Each executor is assigned 10 CPU cores.

5 executors and 10 CPU cores per executor = 50 CPU cores available in total.

With the above setup, Spark can execute a maximum of 50 tasks in parallel at any given time.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

How does Spark decide stages and tasks during execution of a Job? - Hadoop In Real World says:

September 17, 2021 at 6:02 am

[…] Number of tasks equals the number of partitions in a dataset. Check this for more details. […]

How does Spark decide the number of tasks and number of tasks to execute in parallel?

How to save Spark DataFrame directly to a Hive table?

How to export a Hive table into a CSV file?

How to save Spark DataFrame directly to a Hive table?

How to export a Hive table into a CSV file?

Stages and number of tasks per stage

Number of tasks in first stage

Number of tasks in second stage

Number of tasks in third stage

Number of tasks execution in parallel

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?

1 Comment