What is the difference between groupByKey and reduceByKey in Spark?
April 7, 2021How to see the first few lines from a file in S3 using AWS CLI?
April 12, 2021If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset. You probably know how this problem is solved with MapReduce.
In this post we are going to see how to solve the word count problem in Hive.
We have a file with the following content.
When different join strategy hints are specified on both sides of a join, Spark prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint over the SHUFFLE_REPLICATE_NL hint
Our solution should look like below which is basically the number of occurrence of each word in the file.
+-----------------------+------+--+ | words | _c1 | +-----------------------+------+--+ | BROADCAST | 1 | | MERGE | 1 | | SHUFFLE_HASH | 1 | | SHUFFLE_REPLICATE_NL | 1 | | Spark | 1 | | When | 1 | | a | 1 | | are | 1 | | both | 1 | | different | 1 | | hint | 4 | | hints | 1 | | join | 1 | | join, | 1 | | of | 1 | | on | 1 | | over | 3 | | prioritizes | 1 | | sides | 1 | | specified | 1 | | strategy | 1 | | the | 4 | +-----------------------+------+--+
Solution
We will be using split(), explode() and lateral view to solve this problem.
split()
Step 1 – we will split the contents of the file by space. Split will turn each line in the file to an array of words
explode()
Step 2 – we will apply the explode() function on the array of words. explode() is a user-defined table generating function which takes in a row and explode to multiple rows.
In this case, explode will take the array of words and explode each word into a row. If the array has 5 words, we will end up with 5 rows.
LATERAL VIEW
Lateral view is used in conjunction with user-defined table generating functions such as explode() .
A lateral view first applies the UDTF to each row of the base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.
LATERAL VIEW can’t function alone. It needs to be used along with a UDTF. Here we are using explode() to first explode the array to individual rows or words. For the exploded data we are naming the table as expl_words with a column words.
SELECT words, count(1) FROM textfile LATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words GROUP BY words;
LATERAL VIEW joins resulting output exploded rows to the input rows from textfile. In this case, we are not displaying the line column from textfile because we are not interested in that column.
SELECT words, count(1) FROM textfile LATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words GROUP BY words; INFO : Session is already open INFO : Dag name: SELECT words, count(1) FROM textfile...words(Stage-1) INFO : Tez session was closed. Reopening... INFO : Session re-established. INFO : Status: Running (Executing on YARN cluster with App id application_1604763385917_0004) -------------------------------------------------------------------------------- VERTICES STATUS TOTAL COMPLETED RUNNING PENDING FAILED KILLED -------------------------------------------------------------------------------- Map 1 .......... SUCCEEDED 1 1 0 0 0 0 Reducer 2 ...... SUCCEEDED 1 1 0 0 0 0 -------------------------------------------------------------------------------- VERTICES: 02/02 [==========================>>] 100% ELAPSED TIME: 24.89 s -------------------------------------------------------------------------------- +-----------------------+------+--+ | words | _c1 | +-----------------------+------+--+ | BROADCAST | 1 | | MERGE | 1 | | SHUFFLE_HASH | 1 | | SHUFFLE_REPLICATE_NL | 1 | | Spark | 1 | | When | 1 | | a | 1 | | are | 1 | | both | 1 | | different | 1 | | hint | 4 | | hints | 1 | | join | 1 | | join, | 1 | | of | 1 | | on | 1 | | over | 3 | | prioritizes | 1 | | sides | 1 | | specified | 1 | | strategy | 1 | | the | 4 | +-----------------------+------+--+ 22 rows selected (27.572 seconds)