Apache Pig Tutorial – Filter Records

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

In our previous posts we saw different variations of loading a dataset in Pig and also saw how to project and manipulate columns. In this post we will see how to filter records with Apache Pig.

Use FILTER operator to filter the records from the dataset. It is very similar to the WHERE clause in SQL.

Load the stocks dataset.

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

FILTER Operator

Filter stock records with volume over 400,000. Resulting filter_by_volume relation will have stock records with volume greater than 400,000.

grunt> filter_by_volume = FILTER stocks by volume > 400000;

Filter stock records from year 2003. We use the GetYear function to extract the year from the date column. Resulting filter_by_yr relation will have stock records from year 2003.

grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003;

Display Results

Display both filter_by_volume and filter _by_year. Limit the number of records to 100.

grunt> top100 = LIMIT filter_by_volume 100;
grunt> DUMP top100;

grunt> top100 = LIMIT filter_by_yr 100;
grunt> DUMP top100;

See It In Action

Previous Lesson : Project and Manipulate Columns

Next Lesson : Grouping Records

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

Apache Pig Tutorial - Grouping Records - Big Data In Real World says:

March 29, 2023 at 7:41 am

[…] Previous Lesson : Filter Records […]

Apache Pig Tutorial – Filter Records

Apache Pig Tutorial – Project and Manipulate Columns

Apache Pig Tutorial – Grouping Records

Apache Pig Tutorial – Project and Manipulate Columns

Apache Pig Tutorial – Grouping Records

Apache Pig Tutorial – Filter Records

FILTER Operator

Display Results

See It In Action

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

1 Comment