becustom
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114wordpress-seo
domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init
action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home4/joyplace/public_html/wp-includes/functions.php on line 6114If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset. You probably know how this problem is solved with MapReduce.<\/p>\n
In this post we are going to see how to solve the word count problem in Hive.<\/span><\/p>\n We have a file with the following content.<\/span><\/p>\n When different join strategy hints are specified on both sides of a join, Spark prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint over the SHUFFLE_REPLICATE_NL hint<\/span><\/p>\n Our solution should look like below which is basically the number of occurrence of each word in the file.<\/span><\/p>\n We will be using split(), explode() and lateral view to solve this problem.<\/span><\/p>\n Step 1 – we will split the contents of the file by space. Split will turn each line in the file to an array of words<\/span><\/p>\n Step 2 – we will apply the explode() function on the array of words. explode() is a user-defined<\/span> table generating function which takes in a row and explode to multiple rows.<\/span><\/p>\n In this case, explode will take the array of words and explode each word into a row. If the array has 5 words, we will end up with 5 rows.<\/span><\/p>\n Lateral view is used in conjunction with user-defined table generating functions such as explode()<\/span> <\/span>. <\/span><\/p>\n A lateral view first applies the UDTF to each row of the base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.<\/span><\/p>\n LATERAL VIEW can\u2019t function alone. It needs to be used along with a UDTF. Here we are using explode()<\/span> to first explode the array to individual rows or words. For the exploded data we are naming the table as expl_words with a column words.<\/span><\/p>\n LATERAL VIEW joins resulting output exploded rows to the input rows from textfile. In this case, we are not displaying the line column from textfile because we are not interested in that column.<\/span><\/p>\n <\/p>\n","protected":false},"excerpt":{"rendered":" If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset. [\u2026]<\/span><\/p>\n","protected":false},"author":3,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[12],"tags":[],"class_list":["post-2005","post","type-post","status-publish","format-standard","hentry","category-apache-hive"],"yoast_head":"\n+-----------------------+------+--+\n| words | _c1 |\n+-----------------------+------+--+\n| BROADCAST | 1 |\n| MERGE | 1 |\n| SHUFFLE_HASH | 1 |\n| SHUFFLE_REPLICATE_NL | 1 |\n| Spark | 1 |\n| When | 1 |\n| a | 1 |\n| are | 1 |\n| both | 1 |\n| different | 1 |\n| hint | 4 |\n| hints | 1 |\n| join | 1 |\n| join, | 1 |\n| of | 1 |\n| on | 1 |\n| over | 3 |\n| prioritizes | 1 |\n| sides | 1 |\n| specified | 1 |\n| strategy | 1 |\n| the | 4 |\n+-----------------------+------+--+<\/pre>\n
Solution<\/span><\/h2>\n
split()<\/span><\/h3>\n
explode()<\/span><\/h3>\n
LATERAL VIEW<\/span><\/h3>\n
SELECT words, count(1)\nFROM textfile\nLATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words\nGROUP BY words;<\/pre>\n
SELECT words, count(1)\nFROM textfile\nLATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words\nGROUP BY words;\nINFO : Session is already open\nINFO : Dag name: SELECT words, count(1)\nFROM textfile...words(Stage-1)\nINFO : Tez session was closed. Reopening...\nINFO : Session re-established.\nINFO : Status: Running (Executing on YARN cluster with App id application_1604763385917_0004)\n--------------------------------------------------------------------------------\n\n VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED\n\n--------------------------------------------------------------------------------\nMap 1 .......... SUCCEEDED 1 1 0 0 0 0\nReducer 2 ...... SUCCEEDED 1 1 0 0 0 0\n--------------------------------------------------------------------------------\nVERTICES: 02\/02 [==========================>>] 100% ELAPSED TIME: 24.89 s\n--------------------------------------------------------------------------------\n+-----------------------+------+--+\n| words | _c1 |\n+-----------------------+------+--+\n| BROADCAST | 1 |\n| MERGE | 1 |\n| SHUFFLE_HASH | 1 |\n| SHUFFLE_REPLICATE_NL | 1 |\n| Spark | 1 |\n| When | 1 |\n| a | 1 |\n| are | 1 |\n| both | 1 |\n| different | 1 |\n| hint | 4 |\n| hints | 1 |\n| join | 1 |\n| join, | 1 |\n| of | 1 |\n| on | 1 |\n| over | 3 |\n| prioritizes | 1 |\n| sides | 1 |\n| specified | 1 |\n| strategy | 1 |\n| the | 4 |\n+-----------------------+------+--+\n22 rows selected (27.572 seconds)\n\n\n<\/pre>\n