{"id":2284,"date":"2023-05-22T06:00:00","date_gmt":"2023-05-22T11:00:00","guid":{"rendered":"https:\/\/www.bigdatainrealworld.com\/?p=2284"},"modified":"2023-05-09T07:24:24","modified_gmt":"2023-05-09T12:24:24","slug":"how-to-add-total-count-of-dataframe-to-an-already-grouped-dataframe","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/how-to-add-total-count-of-dataframe-to-an-already-grouped-dataframe\/","title":{"rendered":"How to add total count of DataFrame to an already grouped DataFrame?"},"content":{"rendered":"\n<p><\/p>\n\n\n\n<p>Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each project a cost_to_project is assigned.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">val data = Seq(\r\n      (\"Ingestion\", \"Jerry\", 1000), (\"Ingestion\", \"Arya\", 2000), (\"Ingestion\", \"Emily\", 3000),\r\n      (\"ML\", \"Riley\", 9000), (\"ML\", \"Patrick\", 1000), (\"ML\", \"Mickey\", 8000),\r\n      (\"Analytics\", \"Donald\", 1000), (\"Ingestion\", \"John\", 1000), (\"Analytics\", \"Emily\", 8000),\r\n      (\"Analytics\", \"Arya\", 10000), (\"BI\", \"Mickey\", 12000), (\"BI\", \"Martin\", 5000))\r\n\r\nimport spark.sqlContext.implicits._\r\n\r\nval df = data.toDF(\"Project\", \"Name\", \"Cost_To_Project\")\r\n\r\n\r\nscala> df.show()\r\n+---------+-------+---------------+\r\n|  Project|   Name|Cost_To_Project|\r\n+---------+-------+---------------+\r\n|Ingestion|  Jerry|           1000|\r\n|Ingestion|   Arya|           2000|\r\n|Ingestion|  Emily|           3000|\r\n|       ML|  Riley|           9000|\r\n|       ML|Patrick|           1000|\r\n|       ML| Mickey|           8000|\r\n|Analytics| Donald|           1000|\r\n|Ingestion|   John|           1000|\r\n|Analytics|  Emily|           8000|\r\n|Analytics|   Arya|          10000|\r\n|       BI| Mickey|          12000|\r\n|       BI| Martin|           5000|\r\n+---------+-------+---------------+\r\n<\/pre>\n\n\n\n<p>We want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">+-------+------------------+-----------+\r\n|   Name|number_of_projects|Total Count|\r\n+-------+------------------+-----------+\r\n| Mickey|                 2|         12|\r\n| Martin|                 1|         12|\r\n|  Jerry|                 1|         12|\r\n|  Riley|                 1|         12|\r\n| Donald|                 1|         12|\r\n|   John|                 1|         12|\r\n|Patrick|                 1|         12|\r\n|  Emily|                 2|         12|\r\n|   Arya|                 2|         12|\r\n+-------+------------------+-----------+\r\n\n<\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Solution<\/h2>\n\n\n\n<p>It is pretty simple to achieve this. Simply add a column with counting the dataFrame and convert the value to a literal.<\/p>\n\n\n\n<p><\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">val groupBy = df.groupBy(\"Name\").agg(count(\"*\").alias(\"number_of_projects\")).withColumn(\"Total Count\", lit(df.count))\r\ngroupBy.show()\r\n\r\n+-------+------------------+-----------+\r\n|   Name|number_of_projects|Total Count|\r\n+-------+------------------+-----------+\r\n| Mickey|                 2|         12|\r\n| Martin|                 1|         12|\r\n|  Jerry|                 1|         12|\r\n|  Riley|                 1|         12|\r\n| Donald|                 1|         12|\r\n|   John|                 1|         12|\r\n|Patrick|                 1|         12|\r\n|  Emily|                 2|         12|\r\n|   Arya|                 2|         12|\r\n+-------+------------------+-----------+\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each<span class=\"excerpt-hellip\"> [\u2026]<\/span><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[],"class_list":["post-2284","post","type-post","status-publish","format-standard","hentry","category-spark"],"_links":{"self":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/2284","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/comments?post=2284"}],"version-history":[{"count":1,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/2284\/revisions"}],"predecessor-version":[{"id":2285,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/2284\/revisions\/2285"}],"wp:attachment":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/media?parent=2284"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/categories?post=2284"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/tags?post=2284"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}