How to find the number of partitions in a DataFrame?

How to delete multiple documents that match a specific condition in Elasticsearch?

April 13, 2022

How to create a column with unique, incrementing index value in Spark?

April 27, 2022

Published by Big Data In Real World at April 20, 2022

Tags

Let’s say we have a DataFrame with the employee name, project and the cost of the employee to the project.

From this data, we have a DataFrame grouped by Name and summed by Cost_To_Project. We want to find out the number of partitions for the groupedDF DataFrame.

Here is the code.

 val data = Seq(
       ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
       ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
       ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
       ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))
 import spark.sqlContext.implicits._
 val df = data.toDF("Project", "Name", "Cost_To_Project")
 df.show()
 val groupedDF = df.groupBy("Name").sum("Cost_To_Project")

Finding the number of partitions

Simply turn the DataFrame to rdd and call partitions followed by size to get the number of partitions.

 groupedDF.rdd.partitions.size
  Int = 200

We would see the number of partitions as 200. Now why is that?

spark.sql.shuffle.partitions property decides the number of partitions when shuffling data is required for operation like joins, group by or other aggregations. The default value of spark.sql.shuffle.partitions is 200 and that is why we see the number of partitions as 200.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to find the number of partitions in a DataFrame?

How to delete multiple documents that match a specific condition in Elasticsearch?

How to create a column with unique, incrementing index value in Spark?

How to delete multiple documents that match a specific condition in Elasticsearch?

How to create a column with unique, incrementing index value in Spark?

Finding the number of partitions

Big Data In Real World

Related posts

How to kill a running Spark application?

What is the default number of executors in Spark?

What is the default number of cores and amount of memory allocated to an application in Spark?