How to solve word count problem in Hive?

What is the difference between groupByKey and reduceByKey in Spark?

April 7, 2021

How to see the first few lines from a file in S3 using AWS CLI?

April 12, 2021

Published by Big Data In Real World at April 9, 2021

Tags

If you have read about MapReduce you know what a word count problem is. Word count is simply counting the number of words in a dataset. You probably know how this problem is solved with MapReduce.

In this post we are going to see how to solve the word count problem in Hive.

We have a file with the following content.

When different join strategy hints are specified on both sides of a join, Spark prioritizes the BROADCAST hint over the MERGE hint over the SHUFFLE_HASH hint over the SHUFFLE_REPLICATE_NL hint

Our solution should look like below which is basically the number of occurrence of each word in the file.

+-----------------------+------+--+
|         words         | _c1  |
+-----------------------+------+--+
| BROADCAST             | 1    |
| MERGE                 | 1    |
| SHUFFLE_HASH          | 1    |
| SHUFFLE_REPLICATE_NL  | 1    |
| Spark                 | 1    |
| When                  | 1    |
| a                     | 1    |
| are                   | 1    |
| both                  | 1    |
| different             | 1    |
| hint                  | 4    |
| hints                 | 1    |
| join                  | 1    |
| join,                 | 1    |
| of                    | 1    |
| on                    | 1    |
| over                  | 3    |
| prioritizes           | 1    |
| sides                 | 1    |
| specified             | 1    |
| strategy              | 1    |
| the                   | 4    |
+-----------------------+------+--+

Solution

We will be using split(), explode() and lateral view to solve this problem.

split()

Step 1 – we will split the contents of the file by space. Split will turn each line in the file to an array of words

explode()

Step 2 – we will apply the explode() function on the array of words. explode() is a user-defined table generating function which takes in a row and explode to multiple rows.

In this case, explode will take the array of words and explode each word into a row. If the array has 5 words, we will end up with 5 rows.

LATERAL VIEW

Lateral view is used in conjunction with user-defined table generating functions such as explode() .

A lateral view first applies the UDTF to each row of the base table and then joins resulting output rows to the input rows to form a virtual table having the supplied table alias.

LATERAL VIEW can’t function alone. It needs to be used along with a UDTF. Here we are using explode() to first explode the array to individual rows or words. For the exploded data we are naming the table as expl_words with a column words.

SELECT words, count(1)
FROM textfile
LATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words
GROUP BY words;

LATERAL VIEW joins resulting output exploded rows to the input rows from textfile. In this case, we are not displaying the line column from textfile because we are not interested in that column.

SELECT words, count(1)
FROM textfile
LATERAL VIEW EXPLODE(SPLIT(line, ' ')) expl_words AS words
GROUP BY words;
INFO  : Session is already open
INFO  : Dag name: SELECT words, count(1)
FROM textfile...words(Stage-1)
INFO  : Tez session was closed. Reopening...
INFO  : Session re-established.
INFO  : Status: Running (Executing on YARN cluster with App id application_1604763385917_0004)
--------------------------------------------------------------------------------

       VERTICES      STATUS  TOTAL  COMPLETED  RUNNING  PENDING  FAILED  KILLED

--------------------------------------------------------------------------------
Map 1 ..........   SUCCEEDED      1          1        0        0       0       0
Reducer 2 ......   SUCCEEDED      1          1        0        0       0       0
--------------------------------------------------------------------------------
VERTICES: 02/02  [==========================>>] 100%  ELAPSED TIME: 24.89 s
--------------------------------------------------------------------------------
+-----------------------+------+--+
|         words         | _c1  |
+-----------------------+------+--+
| BROADCAST             | 1    |
| MERGE                 | 1    |
| SHUFFLE_HASH          | 1    |
| SHUFFLE_REPLICATE_NL  | 1    |
| Spark                 | 1    |
| When                  | 1    |
| a                     | 1    |
| are                   | 1    |
| both                  | 1    |
| different             | 1    |
| hint                  | 4    |
| hints                 | 1    |
| join                  | 1    |
| join,                 | 1    |
| of                    | 1    |
| on                    | 1    |
| over                  | 3    |
| prioritizes           | 1    |
| sides                 | 1    |
| specified             | 1    |
| strategy              | 1    |
| the                   | 4    |
+-----------------------+------+--+
22 rows selected (27.572 seconds)

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to solve word count problem in Hive?

What is the difference between groupByKey and reduceByKey in Spark?

How to see the first few lines from a file in S3 using AWS CLI?

What is the difference between groupByKey and reduceByKey in Spark?

How to see the first few lines from a file in S3 using AWS CLI?

Solution

split()

explode()

LATERAL VIEW

Big Data In Real World

Related posts

How to transpose or convert columns to rows in Hive?

How to fail a Hive script based on a condition?

How to delete duplicate data from the Hive table?