Apache Pig Tutorial - Filter Records - Big Data In Real World

Apache Pig Tutorial – Filter Records

Apache Pig Tutorial – Project and Manipulate Columns
December 16, 2015
Apache Pig Tutorial – Grouping Records
December 19, 2015
Apache Pig Tutorial – Project and Manipulate Columns
December 16, 2015
Apache Pig Tutorial – Grouping Records
December 19, 2015

Apache Pig Tutorial – Filter Records

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

In our previous posts we saw different variations of loading a dataset in Pig and also saw how to project and manipulate columns. In this post we will see how to filter records with Apache Pig.

Use FILTER  operator to filter the records from the dataset. It is very similar to the WHERE clause in SQL.

Load the stocks dataset.

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

FILTER Operator

Filter stock records with volume over 400,000. Resulting filter_by_volume relation will have stock records with volume greater than 400,000.

grunt> filter_by_volume = FILTER stocks by volume > 400000;

Filter stock records from year 2003. We use the GetYear function to extract the year from the date column. Resulting filter_by_yr relation will have stock records from year 2003.

grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;

Display Results

Display both filter_by_volume and filter _by_year. Limit the number of records to 100.

grunt> top100 = LIMIT filter_by_volume 100;
grunt> DUMP top100;
grunt> top100 = LIMIT filter_by_yr 100;
grunt> DUMP top100;

See It In Action

Previous Lesson : Project and Manipulate Columns

Next Lesson : Grouping Records

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

Apache Pig Tutorial – Filter Records
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X