Apache Pig Tutorial – Project and Manipulate Columns
December 16, 2015Apache Pig Tutorial – Grouping Records
December 19, 2015Apache Pig Tutorial – Filter Records
Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.
In our previous posts we saw different variations of loading a dataset in Pig and also saw how to project and manipulate columns. In this post we will see how to filter records with Apache Pig.
Use FILTER operator to filter the records from the dataset. It is very similar to the WHERE clause in SQL.
Load the stocks dataset.
grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
FILTER Operator
Filter stock records with volume over 400,000. Resulting filter_by_volume relation will have stock records with volume greater than 400,000.
grunt> filter_by_volume = FILTER stocks by volume > 400000;
Filter stock records from year 2003. We use the GetYear function to extract the year from the date column. Resulting filter_by_yr relation will have stock records from year 2003.
grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;
Display Results
Display both filter_by_volume and filter _by_year. Limit the number of records to 100.
grunt> top100 = LIMIT filter_by_volume 100; grunt> DUMP top100;
grunt> top100 = LIMIT filter_by_yr 100; grunt> DUMP top100;
See It In Action
Previous Lesson : Project and Manipulate Columns
Next Lesson : Grouping Records
1 Comment
[…] Previous Lesson : Filter Records […]