How to pivot and unpivot a DataFrame in Spark? - Big Data In Real World

How to pivot and unpivot a DataFrame in Spark?

Understanding stack function in Spark
March 16, 2022
How to parse information from URL in Hive?
March 30, 2022
Understanding stack function in Spark
March 16, 2022
How to parse information from URL in Hive?
March 30, 2022

In this post we are going to describe how to pivot and unpivot a DataFrame in Spark. 

We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each project a cost_to_project is assigned.

 val data = Seq(
       ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
       ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
       ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
       ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
 import spark.sqlContext.implicits._
 val df = data.toDF("Project", "Name", "Cost_To_Project")
 df.show()
 scala> df.show()
 scala> df.show()
 +---------+-------+---------------+
 |  Project|   Name|Cost_To_Project|
 +---------+-------+---------------+
 |Ingestion|  Jerry|           1000|
 |Ingestion|   Arya|           2000|
 |Ingestion|  Emily|           3000|
 |       ML|  Riley|           9000|
 |       ML|Patrick|           1000|
 |       ML| Mickey|           8000|
 |Analytics| Donald|           1000|
 |Ingestion|   John|           1000|
 |Analytics|  Emily|           8000|
 |Analytics|   Arya|          10000|
 |       BI| Mickey|          12000|
 |       BI| Martin|           5000|
 +---------+-------+---------------+ 

Our goal is to turn or pivot this DataFrame in to 

 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+ 

Pivot a DataFrame

Note that in the above example we are turning the Project values to column or in other words we are pivoting the data by Project.

Spark comes with a Pivot function. First we group the data by Name and then pivot on the Project column and we are applying a sum on Cost_To_Project. Here the sum doesn’t do much as we have only one assignment per employee and project combination.

 val pivotDF = df.groupBy("Name").pivot("Project").sum("Cost_To_Project")
 pivotDF.show()
 scala> pivotDF.show()
 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+ 

Pivot function in Spark also takes in a optional list of values. If we know the unique list of values for the column which we are using to Pivot then we can supply it in the second argument like below.

 val projects = Seq("Analytics","BI","Ingestion","ML")
 val pivotDF = df.groupBy("Name").pivot("Project", projects).sum("Cost_To_Project")
 pivotDF.show()
 scala> pivotDF.show()
 +-------+---------+-----+---------+----+
 |   Name|Analytics|   BI|Ingestion|  ML|
 +-------+---------+-----+---------+----+
 | Mickey|     null|12000|     null|8000|
 | Martin|     null| 5000|     null|null|
 |  Jerry|     null| null|     1000|null|
 |  Riley|     null| null|     null|9000|
 | Donald|     1000| null|     null|null|
 |   John|     null| null|     1000|null|
 |Patrick|     null| null|     null|1000|
 |  Emily|     8000| null|     3000|null|
 |   Arya|    10000| null|     2000|null|
 +-------+---------+-----+---------+----+ 

Unpivot a DataFrame

Spark doesn’t come out of the box with an unpivot function. But we can achieve the same functionality with with the stack function and filtering the DataFrame by selecting only the records with Cost_To_Project is not null.

Check out this post if you need help understanding the stack function.

 val unPivotDF = pivotDF.select($"Name", expr("stack(4, 'Analytics', Analytics, 'BI', BI, 'Ingestion', Ingestion, 'ML', ML) as (Project, Cost_To_Project)")).where("Cost_To_Project is not null")
 unPivotDF.show()
 +-------+---------+---------------+
 |   Name|  Project|Cost_To_Project|
 +-------+---------+---------------+
 | Mickey|       BI|          12000|
 | Mickey|       ML|           8000|
 | Martin|       BI|           5000|
 |  Jerry|Ingestion|           1000|
 |  Riley|       ML|           9000|
 | Donald|Analytics|           1000|
 |   John|Ingestion|           1000|
 |Patrick|       ML|           1000|
 |  Emily|Analytics|           8000|
 |  Emily|Ingestion|           3000|
 |   Arya|Analytics|          10000|
 |   Arya|Ingestion|           2000|
 +-------+---------+---------------+ 
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to pivot and unpivot a DataFrame in Spark?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X