{"id":2005,"date":"2021-04-09T06:00:00","date_gmt":"2021-04-09T11:00:00","guid":{"rendered":"https:\/\/www.bigdatainrealworld.com\/?p=2005"},"modified":"2023-02-19T07:31:30","modified_gmt":"2023-02-19T13:31:30","slug":"how-to-solve-word-count-problem-in-hive","status":"publish","type":"post","link":"https:\/\/www.bigdatainrealworld.com\/how-to-solve-word-count-problem-in-hive\/","title":{"rendered":"How to solve word count problem in Hive?"},"content":{"rendered":"<p>If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset. You probably know how this problem is solved with MapReduce.<\/p>\n<p><span style=\"font-weight: 400;\">In this post we are going to see how to solve the word count problem in Hive.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We have a file with the following content.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">When different join strategy hints are specified on both sides of a join, Spark prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint over the SHUFFLE_REPLICATE_NL hint<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Our solution should look like below which is basically the number of occurrence of each word in the file.<\/span><\/p>\n<pre class=\"lang:default decode:true \">+-----------------------+------+--+\n| &nbsp; &nbsp; &nbsp; &nbsp; words &nbsp; &nbsp; &nbsp; &nbsp; | _c1&nbsp; |\n+-----------------------+------+--+\n| BROADCAST &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| MERGE &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| SHUFFLE_HASH&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| SHUFFLE_REPLICATE_NL&nbsp; | 1&nbsp; &nbsp; |\n| Spark &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| When&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| a &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| are &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| both&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| different &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| hint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 4&nbsp; &nbsp; |\n| hints &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| join&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| join, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| of&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| on&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| over&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 3&nbsp; &nbsp; |\n| prioritizes &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| sides &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| specified &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| strategy&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| the &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 4&nbsp; &nbsp; |\n+-----------------------+------+--+<\/pre>\n<h2><span style=\"font-weight: 400;\">Solution<\/span><\/h2>\n<p><span style=\"font-weight: 400;\">We will be using split(), explode() and lateral view to solve this problem.<\/span><\/p>\n<h3><span class=\"lang:default decode:true crayon-inline \">split()<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Step 1 &#8211; we will split the contents of the file by space. Split will turn each line in the file to an array of words<\/span><\/p>\n<h3><span class=\"lang:default decode:true crayon-inline\">explode()<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Step 2 &#8211; we will apply the explode() function on the array of words. explode() is a user-defined<\/span><span style=\"font-weight: 400;\"> table generating function which takes in a row and explode to multiple rows.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In this case, explode will take the array of words and explode each word into a row. If the array has 5 words, we will end up with 5 rows.<\/span><\/p>\n<h3><span style=\"font-weight: 400;\">LATERAL VIEW<\/span><\/h3>\n<p><span style=\"font-weight: 400;\">Lateral view is used in conjunction with user-defined table generating functions such as <span class=\"lang:default decode:true crayon-inline \">explode()<\/span>&nbsp;<\/span><span style=\"font-weight: 400;\">.&nbsp;<\/span><\/p>\n<p><span style=\"font-weight: 400;\">A lateral view first applies the UDTF to each row of the base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">LATERAL VIEW can\u2019t function alone. It needs to be used along with a UDTF. Here we are using<span class=\"lang:default decode:true crayon-inline \"> explode()<\/span>&nbsp; to first explode the array to individual rows or words. For the exploded data we are naming the table as expl_words with a column words.<\/span><\/p>\n<pre class=\"lang:default decode:true \">SELECT words, count(1)\nFROM textfile\nLATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words\nGROUP BY words;<\/pre>\n<p><span style=\"font-weight: 400;\">LATERAL VIEW joins resulting output exploded rows to the input rows from textfile. In this case, we are not displaying the line column from textfile because we are not interested in that column.<\/span><\/p>\n<pre class=\"lang:default decode:true \">SELECT words, count(1)\nFROM textfile\nLATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words\nGROUP BY words;\nINFO&nbsp; : Session is already open\nINFO&nbsp; : Dag name: SELECT words, count(1)\nFROM textfile...words(Stage-1)\nINFO&nbsp; : Tez session was closed. Reopening...\nINFO&nbsp; : Session re-established.\nINFO&nbsp; : Status: Running (Executing on YARN cluster with App id application_1604763385917_0004)\n--------------------------------------------------------------------------------\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;VERTICES&nbsp; &nbsp; &nbsp; STATUS&nbsp; TOTAL&nbsp; COMPLETED&nbsp; RUNNING&nbsp; PENDING&nbsp; FAILED&nbsp; KILLED\n\n--------------------------------------------------------------------------------\nMap 1 .......... &nbsp; SUCCEEDED&nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0\nReducer 2 ...... &nbsp; SUCCEEDED&nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 1&nbsp; &nbsp; &nbsp; &nbsp; 0&nbsp; &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0 &nbsp; &nbsp; &nbsp; 0\n--------------------------------------------------------------------------------\nVERTICES: 02\/02&nbsp; [==========================&gt;&gt;] 100%&nbsp; ELAPSED TIME: 24.89 s\n--------------------------------------------------------------------------------\n+-----------------------+------+--+\n| &nbsp; &nbsp; &nbsp; &nbsp; words &nbsp; &nbsp; &nbsp; &nbsp; | _c1&nbsp; |\n+-----------------------+------+--+\n| BROADCAST &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| MERGE &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| SHUFFLE_HASH&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| SHUFFLE_REPLICATE_NL&nbsp; | 1&nbsp; &nbsp; |\n| Spark &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| When&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| a &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| are &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| both&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| different &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| hint&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 4&nbsp; &nbsp; |\n| hints &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| join&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| join, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| of&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| on&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| over&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 3&nbsp; &nbsp; |\n| prioritizes &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| sides &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| specified &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| strategy&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 1&nbsp; &nbsp; |\n| the &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; | 4&nbsp; &nbsp; |\n+-----------------------+------+--+\n22 rows selected (27.572 seconds)\n\n\n<\/pre>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset.<span class=\"excerpt-hellip\"> [\u2026]<\/span><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"class_list":["post-2005","post","type-post","status-publish","format-standard","hentry","category-apache-hive"],"_links":{"self":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/2005","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/comments?post=2005"}],"version-history":[{"count":3,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/2005\/revisions"}],"predecessor-version":[{"id":2044,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/posts\/2005\/revisions\/2044"}],"wp:attachment":[{"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/media?parent=2005"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/categories?post=2005"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.bigdatainrealworld.com\/wp-json\/wp\/v2\/tags?post=2005"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}