Apache Pig Tutorial - Grouping Records - Big Data In Real World

Apache Pig Tutorial – Grouping Records

Apache Pig Tutorial – Filter Records
December 18, 2015
Apache Pig Tutorial – Ordering Records
December 20, 2015
Apache Pig Tutorial – Filter Records
December 18, 2015
Apache Pig Tutorial – Ordering Records
December 20, 2015

Apache Pig Tutorial – Grouping Records

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

We have been learning a lot of concepts in Apache Pig (look at the previous posts). In this post we will see how to group records using Apache Pig.

With all the concepts we have seen so far, why don’t we solve a meaningful problem. Our stocks dataset has day by day stocks information like opening, closing prices, high, low and volume for the day for several symbols across many years. Let’s calculate the average volume for all stocks symbol from year 2003.

Load & Filter Records from 2003

Look up previous posts to know more about loading and projecting datasets.

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;

Grouping Records

Now the dataset is loaded and filtered let’s group the records by symbol. Grouping records is very simple, use the GROUP…BY operator as follows

grunt> grp_by_sym = GROUP filter_by_yr BY symbol;

Finding Average

Now the records are grouped by symbol we can do more meaningful operations on the grouped records. In order to project grp_by_sym we need to know the structure of  grp_by_sym . Use the DESCRIBE operator on grp_by_sym

grunt> DESCRIBE grp_by_sym;
grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}

From the describe operation we can see that grp_by_sym has two columns group  and filter_by_yr . group  represents the column used for grouping, in our case we have grouped the records by symbol  column so group  represents symbol. filter_by_year  will have the list of records for a given symbol.

Our goal is to project the symbol and the average volume for the symbol. So the below FOREACH projection would achieve that with AVG function.

grunt> avg_volume = FOREACH grp_by_sym GENERATE group, ROUND(AVG(filter_by_yr.volume)) as avgvolume;

Display Result

grunt> DUMP avg_volume;

See It In Action

Previous Lesson : Filter Records

Next Lesson : Ordering Records

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

2 Comments

  1. […] Previous Next Apache Pig Tutorial – Ordering Records […]

Apache Pig Tutorial – Grouping Records
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X