How does Spark decide stages and tasks during execution of a Job?

What are accumulators in Spark, when and when not to use them?

September 15, 2021

What is the difference between partitioning and bucketing a table in Hive?

September 20, 2021

Published by Big Data In Real World at September 17, 2021

Stages

Spark will create a stage for each dataset

All consecutive narrow transformations (for eg. FILTER, MAP etc.) will be grouped together inside the stage

Spark will create a stage when it encounter a wide transformation (for eg. JOIN, reduceByKey etc.).

For the above set of instructions, Spark will create 3 stages –

First stage – Instructions 1, 2 and 3

Second stage – Instructions 4 and 5

Third stage – Instructions 6, 7 and 8

Tasks

Spark creates a task to execute a set of instructions inside a stage.

Number of tasks equals the number of partitions in a dataset. Check this for more details.

Task execute all consecutive narrow transformations inside a stage – it is called pipelining.

Task in first stage will execute instructions 1, 2 and 3

Task in second stage will execute instructions 4 and 5

Task in the third stage will execute instructions 6, 7 and 8.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How does Spark decide stages and tasks during execution of a Job?

What are accumulators in Spark, when and when not to use them?

What is the difference between partitioning and bucketing a table in Hive?

What are accumulators in Spark, when and when not to use them?

What is the difference between partitioning and bucketing a table in Hive?

Stages

Tasks

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?