Understanding stack function in Spark

What is the difference between map and flatMap functions in Spark?

March 9, 2022

How to pivot and unpivot a DataFrame in Spark?

March 23, 2022

Published by Big Data In Real World at March 16, 2022

stack() in action

Let’s see the stack function in action. stack() comes in handy when we attempt to unpivot a dataframe.

Let’s say out data is pivoted and it looks like below.

 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+

We would like to unpivot the data and make it look like below.

 +-------+---------+---------------+
 |   Name|  Project|Cost_To_Project|
 +-------+---------+---------------+
 | Mickey|       BI|          12000|
 | Mickey|       ML|           8000|
 | Martin|       BI|           5000|
 |  Jerry|Ingestion|           1000|
 |  Riley|       ML|           9000|
 | Donald|Analytics|           1000|
 |   John|Ingestion|           1000|
 |Patrick|       ML|           1000|
 |  Emily|Analytics|           8000|
 |  Emily|Ingestion|           3000|
 |   Arya|Analytics|          10000|
 |   Arya|Ingestion|           2000|
 +-------+---------+---------------+

We can use the stack function like below to get the job done.

For every record in pivotDF, we are selecting Name and calling stack function
stack function will create 4 rows for every row in pivotDF and will create 2 columns – Project and Cost_To_Project
4 rows because we have 4 unique projects in our dataset
Project column is a hard coded literal in the stack function definition (for eg. ‘Analytics’)
Cost_To_Project value is fetched by selecting the value of the specified column name from the dataset (for eg. col(Analytics))

 pivotDF.select($"Name", expr("stack(4, 'Analytics', Analytics, 'BI', BI, 'Ingestion', Ingestion, 'ML', ML) as (Project, Cost_To_Project)")).show(false)
 +------+---------+---------------+
 |Name  |Project  |Cost_To_Project|
 +------+---------+---------------+
 |Mickey|Analytics|null           |
 |Mickey|BI       |12000          |
 |Mickey|Ingestion|null           |
 |Mickey|ML       |8000           |
 |Martin|Analytics|null           |
 |Martin|BI       |5000           |
 |Martin|Ingestion|null           |
 |Martin|ML       |null           |
 |Jerry |Analytics|null           |
 |Jerry |BI       |null           |
 |Jerry |Ingestion|1000           |
 |Jerry |ML       |null           |
 |Riley |Analytics|null           |
 |Riley |BI       |null           |
 |Riley |Ingestion|null           |
 |Riley |ML       |9000           |
 |Donald|Analytics|1000           |
 ----
 ----

Finally, we can filter the DataFrame by ignoring records with Cost_To_Project NULL.

 val unPivotDF = pivotDF.select($"Name", expr("stack(4, 'Analytics', Analytics, 'BI', BI, 'Ingestion', Ingestion, 'ML', ML) as (Project, Cost_To_Project)")).where("Cost_To_Project is not null")
 unPivotDF.show()
 +-------+---------+---------------+
 |   Name|  Project|Cost_To_Project|
 +-------+---------+---------------+
 | Mickey|       BI|          12000|
 | Mickey|       ML|           8000|
 | Martin|       BI|           5000|
 |  Jerry|Ingestion|           1000|
 |  Riley|       ML|           9000|
 | Donald|Analytics|           1000|
 |   John|Ingestion|           1000|
 |Patrick|       ML|           1000|
 |  Emily|Analytics|           8000|
 |  Emily|Ingestion|           3000|
 |   Arya|Analytics|          10000|
 |   Arya|Ingestion|           2000|
 +-------+---------+---------------+

Full code

 val data = Seq(
       ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
       ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
       ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
       ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
 import spark.sqlContext.implicits._
 val df = data.toDF("Project", "Name", "Cost_To_Project")
 --pivot
 val pivotDF = df.groupBy("Name").pivot("Project").sum("Cost_To_Project")
 pivotDF.show()
 --unpivot
 val unPivotDF = pivotDF.select($"Name", expr("stack(4, 'Analytics', Analytics, 'BI', BI, 'Ingestion', Ingestion, 'ML', ML) as (Project, Cost_To_Project)")).where("Cost_To_Project is not null")
 unPivotDF.show()

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

2 Comments

How to pivot and unpivot a DataFrame in Spark? - Hadoop In Real World says:

March 23, 2022 at 6:00 am

[…] Check out this post if you need help understanding the stack function. […]
Pivoting A Single Row Dataframe Where GroupBy Can Not Be Applied - Programming Questions And Solutions Blog says:

December 8, 2022 at 9:01 am

[…] You can use the stack() operation as mentioned in this tutorial. […]

Understanding stack function in Spark

What is the difference between map and flatMap functions in Spark?

How to pivot and unpivot a DataFrame in Spark?

What is the difference between map and flatMap functions in Spark?

How to pivot and unpivot a DataFrame in Spark?

stack() in action

Full code

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?

2 Comments