Understanding stack function in Spark
March 16, 2022How to parse information from URL in Hive?
March 30, 2022In this post we are going to describe how to pivot and unpivot a DataFrame in Spark.
We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each project a cost_to_project is assigned.
val data = Seq( ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000), ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000), ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000), ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000)) import spark.sqlContext.implicits._ val df = data.toDF("Project", "Name", "Cost_To_Project") df.show() scala> df.show() scala> df.show() +---------+-------+---------------+ | Project| Name|Cost_To_Project| +---------+-------+---------------+ |Ingestion| Jerry| 1000| |Ingestion| Arya| 2000| |Ingestion| Emily| 3000| | ML| Riley| 9000| | ML|Patrick| 1000| | ML| Mickey| 8000| |Analytics| Donald| 1000| |Ingestion| John| 1000| |Analytics| Emily| 8000| |Analytics| Arya| 10000| | BI| Mickey| 12000| | BI| Martin| 5000| +---------+-------+---------------+
Our goal is to turn or pivot this DataFrame in to
+-------+---------+-----+---------+----+ | Name|Analytics| BI|Ingestion| ML| +-------+---------+-----+---------+----+ | Mickey| null|12000| null|8000| | Martin| null| 5000| null|null| | Jerry| null| null| 1000|null| | Riley| null| null| null|9000| | Donald| 1000| null| null|null| | John| null| null| 1000|null| |Patrick| null| null| null|1000| | Emily| 8000| null| 3000|null| | Arya| 10000| null| 2000|null| +-------+---------+-----+---------+----+
Pivot a DataFrame
Note that in the above example we are turning the Project values to column or in other words we are pivoting the data by Project.
Spark comes with a Pivot function. First we group the data by Name and then pivot on the Project column and we are applying a sum on Cost_To_Project. Here the sum doesn’t do much as we have only one assignment per employee and project combination.
val pivotDF = df.groupBy("Name").pivot("Project").sum("Cost_To_Project") pivotDF.show() scala> pivotDF.show() +-------+---------+-----+---------+----+ | Name|Analytics| BI|Ingestion| ML| +-------+---------+-----+---------+----+ | Mickey| null|12000| null|8000| | Martin| null| 5000| null|null| | Jerry| null| null| 1000|null| | Riley| null| null| null|9000| | Donald| 1000| null| null|null| | John| null| null| 1000|null| |Patrick| null| null| null|1000| | Emily| 8000| null| 3000|null| | Arya| 10000| null| 2000|null| +-------+---------+-----+---------+----+
Pivot function in Spark also takes in a optional list of values. If we know the unique list of values for the column which we are using to Pivot then we can supply it in the second argument like below.
val projects = Seq("Analytics","BI","Ingestion","ML") val pivotDF = df.groupBy("Name").pivot("Project", projects).sum("Cost_To_Project") pivotDF.show() scala> pivotDF.show() +-------+---------+-----+---------+----+ | Name|Analytics| BI|Ingestion| ML| +-------+---------+-----+---------+----+ | Mickey| null|12000| null|8000| | Martin| null| 5000| null|null| | Jerry| null| null| 1000|null| | Riley| null| null| null|9000| | Donald| 1000| null| null|null| | John| null| null| 1000|null| |Patrick| null| null| null|1000| | Emily| 8000| null| 3000|null| | Arya| 10000| null| 2000|null| +-------+---------+-----+---------+----+
Unpivot a DataFrame
Spark doesn’t come out of the box with an unpivot function. But we can achieve the same functionality with with the stack function and filtering the DataFrame by selecting only the records with Cost_To_Project is not null.
Check out this post if you need help understanding the stack function.
val unPivotDF = pivotDF.select($"Name", expr("stack(4, 'Analytics', Analytics, 'BI', BI, 'Ingestion', Ingestion, 'ML', ML) as (Project, Cost_To_Project)")).where("Cost_To_Project is not null") unPivotDF.show() +-------+---------+---------------+ | Name| Project|Cost_To_Project| +-------+---------+---------------+ | Mickey| BI| 12000| | Mickey| ML| 8000| | Martin| BI| 5000| | Jerry|Ingestion| 1000| | Riley| ML| 9000| | Donald|Analytics| 1000| | John|Ingestion| 1000| |Patrick| ML| 1000| | Emily|Analytics| 8000| | Emily|Ingestion| 3000| | Arya|Analytics| 10000| | Arya|Ingestion| 2000| +-------+---------+---------------+