What are accumulators in Spark, when and when not to use them?
September 15, 2021What is the difference between partitioning and bucketing a table in Hive?
September 20, 2021Let’s see this with an example. Here is our series of instructions in our Spark code. Let’s see how Spark decide on stages and tasks with the below set of instructions.
- READ dataset_X
- FILTER on dataset_X
- MAP operation on dataset_X
- READ dataset_Y
- MAP operation on dataset_Y
- JOIN dataset_X and dataset_Y
- FILTER on joined dataset
- SAVE the output
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Stages
Spark will create a stage for each dataset
All consecutive narrow transformations (for eg. FILTER, MAP etc.) will be grouped together inside the stage
Spark will create a stage when it encounter a wide transformation (for eg. JOIN, reduceByKey etc.).
For the above set of instructions, Spark will create 3 stages –
First stage – Instructions 1, 2 and 3
Second stage – Instructions 4 and 5
Third stage – Instructions 6, 7 and 8
Tasks
Spark creates a task to execute a set of instructions inside a stage.
Number of tasks equals the number of partitions in a dataset. Check this for more details.
Task execute all consecutive narrow transformations inside a stage – it is called pipelining.
Task in first stage will execute instructions 1, 2 and 3
Task in second stage will execute instructions 4 and 5
Task in the third stage will execute instructions 6, 7 and 8.