How to delete multiple documents that match a specific condition in Elasticsearch?
April 13, 2022How to create a column with unique, incrementing index value in Spark?
April 27, 2022Let’s say we have a DataFrame with the employee name, project and the cost of the employee to the project.
From this data, we have a DataFrame grouped by Name and summed by Cost_To_Project. We want to find out the number of partitions for the groupedDF DataFrame.
Here is the code.
val data = Seq( ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000), ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000), ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000), ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000)) import spark.sqlContext.implicits._ val df = data.toDF("Project", "Name", "Cost_To_Project") df.show() val groupedDF = df.groupBy("Name").sum("Cost_To_Project")
Finding the number of partitions
Simply turn the DataFrame to rdd and call partitions followed by size to get the number of partitions.
groupedDF.rdd.partitions.size Int = 200
We would see the number of partitions as 200. Now why is that?
spark.sql.shuffle.partitions property decides the number of partitions when shuffling data is required for operation like joins, group by or other aggregations. The default value of spark.sql.shuffle.partitions is 200 and that is why we see the number of partitions as 200.