How to find the number of partitions in a DataFrame? - Big Data In Real World

How to find the number of partitions in a DataFrame?

How to delete multiple documents that match a specific condition in Elasticsearch?
April 13, 2022
How to create a column with unique, incrementing index value in Spark?
April 27, 2022
How to delete multiple documents that match a specific condition in Elasticsearch?
April 13, 2022
How to create a column with unique, incrementing index value in Spark?
April 27, 2022

Let’s say we have a DataFrame with the employee name, project and the cost of the employee to the project.

From this data, we have a DataFrame grouped by Name and summed by Cost_To_Project. We want to find out the number of partitions for the groupedDF DataFrame.

Here is the code.

 val data = Seq(
       ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
       ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
       ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
       ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
 import spark.sqlContext.implicits._
 val df = data.toDF("Project", "Name", "Cost_To_Project")
 df.show()
 val groupedDF = df.groupBy("Name").sum("Cost_To_Project") 

Finding the number of partitions

Simply turn the DataFrame to rdd and call partitions followed by size to get the number of partitions.

 groupedDF.rdd.partitions.size
  Int = 200 

We would see the number of partitions as 200. Now why is that?

spark.sql.shuffle.partitions property decides the number of partitions when shuffling data is required for operation like joins, group by or other aggregations. The default value of spark.sql.shuffle.partitions is 200 and that is why we see the number of partitions as 200.

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to find the number of partitions in a DataFrame?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X