Apache Pig Tutorial – Filter Records
December 18, 2015Apache Pig Tutorial – Ordering Records
December 20, 2015Apache Pig Tutorial – Grouping Records
Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.
We have been learning a lot of concepts in Apache Pig (look at the previous posts). In this post we will see how to group records using Apache Pig.
With all the concepts we have seen so far, why don’t we solve a meaningful problem. Our stocks dataset has day by day stocks information like opening, closing prices, high, low and volume for the day for several symbols across many years. Let’s calculate the average volume for all stocks symbol from year 2003.
Load & Filter Records from 2003
Look up previous posts to know more about loading and projecting datasets.
grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float); grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;
Grouping Records
Now the dataset is loaded and filtered let’s group the records by symbol. Grouping records is very simple, use the GROUP…BY operator as follows
grunt> grp_by_sym = GROUP filter_by_yr BY symbol;
Finding Average
Now the records are grouped by symbol we can do more meaningful operations on the grouped records. In order to project grp_by_sym we need to know the structure of grp_by_sym . Use the DESCRIBE operator on grp_by_sym
grunt> DESCRIBE grp_by_sym; grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}
From the describe operation we can see that grp_by_sym has two columns group and filter_by_yr . group represents the column used for grouping, in our case we have grouped the records by symbol column so group represents symbol. filter_by_year will have the list of records for a given symbol.
Our goal is to project the symbol and the average volume for the symbol. So the below FOREACH projection would achieve that with AVG function.
grunt> avg_volume = FOREACH grp_by_sym GENERATE group, ROUND(AVG(filter_by_yr.volume)) as avgvolume;
Display Result
grunt> DUMP avg_volume;
See It In Action
Previous Lesson : Filter Records
Next Lesson : Ordering Records
2 Comments
[…] Previous Next Apache Pig Tutorial – Ordering Records […]
[…] Next Lesson : Grouping Records […]