Apache Pig Tutorial – Filter Records

December 18, 2015

Apache Pig Tutorial – Ordering Records

December 20, 2015

Published by Big Data In Real World at December 19, 2015

Apache Pig Tutorial – Grouping Records

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

We have been learning a lot of concepts in Apache Pig (look at the previous posts). In this post we will see how to group records using Apache Pig.

With all the concepts we have seen so far, why don’t we solve a meaningful problem. Our stocks dataset has day by day stocks information like opening, closing prices, high, low and volume for the day for several symbols across many years. Let’s calculate the average volume for all stocks symbol from year 2003.

Load & Filter Records from 2003

Look up previous posts to know more about loading and projecting datasets.

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;

Grouping Records

Now the dataset is loaded and filtered let’s group the records by symbol. Grouping records is very simple, use the GROUP…BY operator as follows

grunt> grp_by_sym = GROUP filter_by_yr BY symbol;

Finding Average

Now the records are grouped by symbol we can do more meaningful operations on the grouped records. In order to project grp_by_sym we need to know the structure of grp_by_sym . Use the DESCRIBE operator on grp_by_sym

grunt> DESCRIBE grp_by_sym;
grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}

From the describe operation we can see that grp_by_sym has two columns group and filter_by_yr . group represents the column used for grouping, in our case we have grouped the records by symbol column so group represents symbol. filter_by_year will have the list of records for a given symbol.

Our goal is to project the symbol and the average volume for the symbol. So the below FOREACH projection would achieve that with AVG function.

grunt> avg_volume = FOREACH grp_by_sym GENERATE group, ROUND(AVG(filter_by_yr.volume)) as avgvolume;

Display Result

grunt> DUMP avg_volume;

See It In Action

Previous Lesson : Filter Records

Next Lesson : Ordering Records

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

2 Comments

Apache Pig Tutorial - Ordering Records - Hadoop In Real World says:

December 20, 2015 at 7:04 am

[…] Previous Next Apache Pig Tutorial – Ordering Records […]
Apache Pig Tutorial - Filter Records - Big Data In Real World says:

March 29, 2023 at 7:44 am

[…] Next Lesson : Grouping Records […]

Apache Pig Tutorial – Grouping Records

Apache Pig Tutorial – Filter Records

Apache Pig Tutorial – Ordering Records

Apache Pig Tutorial – Filter Records

Apache Pig Tutorial – Ordering Records

Apache Pig Tutorial – Grouping Records

Load & Filter Records from 2003

Grouping Records

Finding Average

Display Result

See It In Action

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

2 Comments