becustom
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114wordpress-seo
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114<\/p>\n\n\n\n
Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each project a cost_to_project is assigned.<\/p>\n\n\n\n
val data = Seq(\r\n (\"Ingestion\", \"Jerry\", 1000), (\"Ingestion\", \"Arya\", 2000), (\"Ingestion\", \"Emily\", 3000),\r\n (\"ML\", \"Riley\", 9000), (\"ML\", \"Patrick\", 1000), (\"ML\", \"Mickey\", 8000),\r\n (\"Analytics\", \"Donald\", 1000), (\"Ingestion\", \"John\", 1000), (\"Analytics\", \"Emily\", 8000),\r\n (\"Analytics\", \"Arya\", 10000), (\"BI\", \"Mickey\", 12000), (\"BI\", \"Martin\", 5000))\r\n\r\nimport spark.sqlContext.implicits._\r\n\r\nval df = data.toDF(\"Project\", \"Name\", \"Cost_To_Project\")\r\n\r\n\r\nscala> df.show()\r\n+---------+-------+---------------+\r\n| Project| Name|Cost_To_Project|\r\n+---------+-------+---------------+\r\n|Ingestion| Jerry| 1000|\r\n|Ingestion| Arya| 2000|\r\n|Ingestion| Emily| 3000|\r\n| ML| Riley| 9000|\r\n| ML|Patrick| 1000|\r\n| ML| Mickey| 8000|\r\n|Analytics| Donald| 1000|\r\n|Ingestion| John| 1000|\r\n|Analytics| Emily| 8000|\r\n|Analytics| Arya| 10000|\r\n| BI| Mickey| 12000|\r\n| BI| Martin| 5000|\r\n+---------+-------+---------------+\r\n<\/pre>\n\n\n\nWe want to group the dataset by Name and get a count to see the employee and the number of projects they are assigned to. In addition to that sub count, we also want to add a column with a total count like below.<\/p>\n\n\n\n
<\/p>\n\n\n\n
+-------+------------------+-----------+\r\n| Name|number_of_projects|Total Count|\r\n+-------+------------------+-----------+\r\n| Mickey| 2| 12|\r\n| Martin| 1| 12|\r\n| Jerry| 1| 12|\r\n| Riley| 1| 12|\r\n| Donald| 1| 12|\r\n| John| 1| 12|\r\n|Patrick| 1| 12|\r\n| Emily| 2| 12|\r\n| Arya| 2| 12|\r\n+-------+------------------+-----------+\r\n\n<\/pre>\n\n\n\nSolution<\/h2>\n\n\n\n
It is pretty simple to achieve this. Simply add a column with counting the dataFrame and convert the value to a literal.<\/p>\n\n\n\n
<\/p>\n\n\n\n
val groupBy = df.groupBy(\"Name\").agg(count(\"*\").alias(\"number_of_projects\")).withColumn(\"Total Count\", lit(df.count))\r\ngroupBy.show()\r\n\r\n+-------+------------------+-----------+\r\n| Name|number_of_projects|Total Count|\r\n+-------+------------------+-----------+\r\n| Mickey| 2| 12|\r\n| Martin| 1| 12|\r\n| Jerry| 1| 12|\r\n| Riley| 1| 12|\r\n| Donald| 1| 12|\r\n| John| 1| 12|\r\n|Patrick| 1| 12|\r\n| Emily| 2| 12|\r\n| Arya| 2| 12|\r\n+-------+------------------+-----------+\r\n<\/pre>\n","protected":false},"excerpt":{"rendered":"Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each [\u2026]<\/span><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[13],"tags":[],"class_list":["post-2284","post","type-post","status-publish","format-standard","hentry","category-spark"],"yoast_head":"\n
How to add total count of DataFrame to an already grouped DataFrame? - Big Data In Real World<\/title>\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\n\t\n\t\n