How to add total count of DataFrame to an already grouped DataFrame? - Big Data In Real World

How to add total count of DataFrame to an already grouped DataFrame?

How to query data from Snowflake in Spark?
May 15, 2023
How to list topics without accessing Zookeeper in Kafka?
May 29, 2023
How to query data from Snowflake in Spark?
May 15, 2023
How to list topics without accessing Zookeeper in Kafka?
May 29, 2023

Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each project a cost_to_project is assigned.

val data = Seq(
      ("Ingestion", "Jerry", 1000), ("Ingestion", "Arya", 2000), ("Ingestion", "Emily", 3000),
      ("ML", "Riley", 9000), ("ML", "Patrick", 1000), ("ML", "Mickey", 8000),
      ("Analytics", "Donald", 1000), ("Ingestion", "John", 1000), ("Analytics", "Emily", 8000),
      ("Analytics", "Arya", 10000), ("BI", "Mickey", 12000), ("BI", "Martin", 5000))

import spark.sqlContext.implicits._

val df = data.toDF("Project", "Name", "Cost_To_Project")


scala> df.show()
+---------+-------+---------------+
|  Project|   Name|Cost_To_Project|
+---------+-------+---------------+
|Ingestion|  Jerry|           1000|
|Ingestion|   Arya|           2000|
|Ingestion|  Emily|           3000|
|       ML|  Riley|           9000|
|       ML|Patrick|           1000|
|       ML| Mickey|           8000|
|Analytics| Donald|           1000|
|Ingestion|   John|           1000|
|Analytics|  Emily|           8000|
|Analytics|   Arya|          10000|
|       BI| Mickey|          12000|
|       BI| Martin|           5000|
+---------+-------+---------------+

We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.

+-------+------------------+-----------+
|   Name|number_of_projects|Total Count|
+-------+------------------+-----------+
| Mickey|                 2|         12|
| Martin|                 1|         12|
|  Jerry|                 1|         12|
|  Riley|                 1|         12|
| Donald|                 1|         12|
|   John|                 1|         12|
|Patrick|                 1|         12|
|  Emily|                 2|         12|
|   Arya|                 2|         12|
+-------+------------------+-----------+

Solution

It is pretty simple to achieve this. Simply add a column with counting the dataFrame and convert the value to a literal.

val groupBy = df.groupBy("Name").agg(count("*").alias("number_of_projects")).withColumn("Total Count", lit(df.count))
groupBy.show()

+-------+------------------+-----------+
|   Name|number_of_projects|Total Count|
+-------+------------------+-----------+
| Mickey|                 2|         12|
| Martin|                 1|         12|
|  Jerry|                 1|         12|
|  Riley|                 1|         12|
| Donald|                 1|         12|
|   John|                 1|         12|
|Patrick|                 1|         12|
|  Emily|                 2|         12|
|   Arya|                 2|         12|
+-------+------------------+-----------+
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] The Big Data in Real World team does some counting: […]

How to add total count of DataFrame to an already grouped DataFrame?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X