How to pivot and unpivot a DataFrame in Spark?

Understanding stack function in Spark

March 16, 2022

How to parse information from URL in Hive?

March 30, 2022

Published by Big Data In Real World at March 23, 2022

Tags

In this post we are going to describe how to pivot and unpivot a DataFrame in Spark.

We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each project a cost_to_project is assigned.

 val data = Seq(
       ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
       ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
       ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
       ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
 import spark.sqlContext.implicits._
 val df = data.toDF("Project", "Name", "Cost_To_Project")
 df.show()
 scala> df.show()
 scala> df.show()
 +---------+-------+---------------+
 |  Project|   Name|Cost_To_Project|
 +---------+-------+---------------+
 |Ingestion|  Jerry|           1000|
 |Ingestion|   Arya|           2000|
 |Ingestion|  Emily|           3000|
 |       ML|  Riley|           9000|
 |       ML|Patrick|           1000|
 |       ML| Mickey|           8000|
 |Analytics| Donald|           1000|
 |Ingestion|   John|           1000|
 |Analytics|  Emily|           8000|
 |Analytics|   Arya|          10000|
 |       BI| Mickey|          12000|
 |       BI| Martin|           5000|
 +---------+-------+---------------+

Our goal is to turn or pivot this DataFrame in to

 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+

Pivot a DataFrame

Note that in the above example we are turning the Project values to column or in other words we are pivoting the data by Project.

Spark comes with a Pivot function. First we group the data by Name and then pivot on the Project column and we are applying a sum on Cost_To_Project. Here the sum doesn’t do much as we have only one assignment per employee and project combination.

 val pivotDF = df.groupBy("Name").pivot("Project").sum("Cost_To_Project")
 pivotDF.show()
 scala> pivotDF.show()
 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+

Pivot function in Spark also takes in a optional list of values. If we know the unique list of values for the column which we are using to Pivot then we can supply it in the second argument like below.

 val projects = Seq("Analytics","BI","Ingestion","ML")
 val pivotDF = df.groupBy("Name").pivot("Project", projects).sum("Cost_To_Project")
 pivotDF.show()
 scala> pivotDF.show()
 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+

Unpivot a DataFrame

Spark doesn’t come out of the box with an unpivot function. But we can achieve the same functionality with with the stack function and filtering the DataFrame by selecting only the records with Cost_To_Project is not null.

Check out this post if you need help understanding the stack function.

 val unPivotDF = pivotDF.select($"Name", expr("stack(4, 'Analytics', Analytics, 'BI', BI, 'Ingestion', Ingestion, 'ML', ML) as (Project, Cost_To_Project)")).where("Cost_To_Project is not null")
 unPivotDF.show()
 +-------+---------+---------------+
 |   Name|  Project|Cost_To_Project|
 +-------+---------+---------------+
 | Mickey|       BI|          12000|
 | Mickey|       ML|           8000|
 | Martin|       BI|           5000|
 |  Jerry|Ingestion|           1000|
 |  Riley|       ML|           9000|
 | Donald|Analytics|           1000|
 |   John|Ingestion|           1000|
 |Patrick|       ML|           1000|
 |  Emily|Analytics|           8000|
 |  Emily|Ingestion|           3000|
 |   Arya|Analytics|          10000|
 |   Arya|Ingestion|           2000|
 +-------+---------+---------------+

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to pivot and unpivot a DataFrame in Spark?

Understanding stack function in Spark

How to parse information from URL in Hive?

Understanding stack function in Spark

How to parse information from URL in Hive?

Pivot a DataFrame

Unpivot a DataFrame

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?