How to properly remove or decommission a node from an Elasticsearch cluster?
April 24, 2023Getting started with Snowflake in less than 10 minutes
May 1, 2023Unfortunately there is no built in function to transpose a DataFrame from columns to rows in Spark. In this post we will show an easy way to transpose a DataFrame from columns to rows in Spark.
Here is our data.
val data = Seq( ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000), ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000), ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000), ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000)) import spark.sqlContext.implicits._ val df = data.toDF("Project", "Name", "Cost_To_Project") df.show() +---------+-------+---------------+ | Project| Name|Cost_To_Project| +---------+-------+---------------+ |Ingestion| Jerry| 1000| |Ingestion| Arya| 2000| |Ingestion| Emily| 3000| | ML| Riley| 9000| | ML|Patrick| 1000| | ML| Mickey| 8000| |Analytics| Donald| 1000| |Ingestion| John| 1000| |Analytics| Emily| 8000| |Analytics| Arya| 10000| | BI| Mickey| 12000| | BI| Martin| 5000| +---------+-------+---------------+
We want to transpose this data below. As you can see the columns are now turned into rows.
+---------------+---------+ |key |value | +---------------+---------+ |Project |Ingestion| |Name |Jerry | |Cost_To_Project|1000 | |Project |Ingestion| |Name |Arya | |Cost_To_Project|2000 | |Project |Ingestion| |Name |Emily | |Cost_To_Project|3000 | |Project |ML | |Name |Riley | |Cost_To_Project|9000 | |Project |ML | |Name |Patrick | |Cost_To_Project|1000 | |Project |ML | |Name |Mickey | |Cost_To_Project|8000 | |Project |Analytics| |Name |Donald | +---------------+---------+
Create a map of all columns
For each row in the DataFrame we will create a map of each column in the row with column name as the key and value of the column as value.
var colAndVals = df.columns.flatMap { c => Array(lit(c), col(c)) } scala> df.withColumn("myMap", map(colAndVals:_*)).show(false) +---------+-------+---------------+---------------------------------------------------------------+ |Project |Name |Cost_To_Project|myMap | +---------+-------+---------------+---------------------------------------------------------------+ |Ingestion|Jerry |1000 |[Project -> Ingestion, Name -> Jerry, Cost_To_Project -> 1000] | |Ingestion|Arya |2000 |[Project -> Ingestion, Name -> Arya, Cost_To_Project -> 2000] | |Ingestion|Emily |3000 |[Project -> Ingestion, Name -> Emily, Cost_To_Project -> 3000] | |ML |Riley |9000 |[Project -> ML, Name -> Riley, Cost_To_Project -> 9000] | |ML |Patrick|1000 |[Project -> ML, Name -> Patrick, Cost_To_Project -> 1000] | |ML |Mickey |8000 |[Project -> ML, Name -> Mickey, Cost_To_Project -> 8000] | |Analytics|Donald |1000 |[Project -> Analytics, Name -> Donald, Cost_To_Project -> 1000]| |Ingestion|John |1000 |[Project -> Ingestion, Name -> John, Cost_To_Project -> 1000] | |Analytics|Emily |8000 |[Project -> Analytics, Name -> Emily, Cost_To_Project -> 8000] | |Analytics|Arya |10000 |[Project -> Analytics, Name -> Arya, Cost_To_Project -> 10000] | |BI |Mickey |12000 |[Project -> BI, Name -> Mickey, Cost_To_Project -> 12000] | |BI |Martin |5000 |[Project -> BI, Name -> Martin, Cost_To_Project -> 5000] | +---------+-------+---------------+---------------------------------------------------------------+
Explode the map
Explode the map once a map is created for each row.
df.withColumn("myMap", map(colAndVals:_*)).select(explode($"myMap")).show(false) +---------------+---------+ |key |value | +---------------+---------+ |Project |Ingestion| |Name |Jerry | |Cost_To_Project|1000 | |Project |Ingestion| |Name |Arya | |Cost_To_Project|2000 | |Project |Ingestion| |Name |Emily | |Cost_To_Project|3000 | |Project |ML | |Name |Riley | |Cost_To_Project|9000 | |Project |ML | |Name |Patrick | |Cost_To_Project|1000 | |Project |ML | |Name |Mickey | |Cost_To_Project|8000 | |Project |Analytics| |Name |Donald | +---------------+---------+
Full code
val data = Seq( ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000), ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000), ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000), ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000)) import spark.sqlContext.implicits._ val df = data.toDF("Project", "Name", "Cost_To_Project") df.show() var colAndVals = df.columns.flatMap { c => Array(lit(c), col(c)) } df.withColumn("myMap", map(colAndVals:_*)).select(explode($"myMap")).toDF("Columns","Values").show(false)
1 Comment
[…] The Big Data in Real World team makes a pivot: […]