Apache Pig Tutorial - Project and Manipulate Columns - Big Data In Real World

Apache Pig Tutorial – Project and Manipulate Columns

Apache Pig Tutorial -Load Variations
December 14, 2015
Apache Pig Tutorial – Filter Records
December 18, 2015
Apache Pig Tutorial -Load Variations
December 14, 2015
Apache Pig Tutorial – Filter Records
December 18, 2015

Apache Pig Tutorial – Project and Manipulate Columns

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

In this post, we saw different variations of loading a dataset in Pig. In this post we will see how to project and manipulate columns.

Load Dataset With Column Names and Types

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

Use FOREACH..GENERATE operator for projecting the columns in the dataset.

grunt> projection = FOREACH stocks GENERATE symbol, SUBSTRING(exchange, 0, 1) as sub_exch, close - open as up_or_down;

Now projection  will have 3 columns – symbol, sub_exch which is a result of a substring operation on the exchange column and up_or_down which is a result of close price minus the opening price.

grunt> top100 = LIMIT projection 100;
grunt> DUMP top100;

 Load Dataset With Out Column Names and Types

grunt> stocks1 = LOAD '/user/hirw/input/stocks' USING PigStorage(',');
grunt> DESCRIBE stocks1;
Schema for stocks1 unknown.

With schema unknown for stocks1 relation how can we project and manipulate column? Very simple, we can use the column position starting with index 0 for first column.

grunt> projection = FOREACH stocks1 GENERATE $1 as symbol, SUBSTRING($0, 0, 1) as sub_exch, $6 - $3 as up_or_down;

How does Pig figure out the datatypes? Pig is smart, it looks at the operation you are tying to perform on the columns and cast the type from the default bytearrary to appropriate type. For example since we are trying to do a substring operation on the exchange column ($0), Pig assumes exchange column in string and will cast bytearray to chararray.

Similarly close ($6) and open ($3) are converted to double (casting to integer or float might lose precision) since these columns are involved in numerical calculation.

grunt> DESCRIBE projection;
2015-12-09 12:41:55,172 [main] WARN org.apache.pig.PigServer - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2015-12-09 12:41:55,172 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).
2015-12-09 12:41:55,172 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 2 time(s).
projection: {symbol: bytearray,sub_exch: chararray,up_or_down: double}

Finally, display the results.

grunt> top100 = LIMIT projection 100;
grunt> DUMP top100;

 See It In Action

Previous Lesson : Load Variations

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] Previous Lesson : Project and Manipulate Columns […]

Apache Pig Tutorial – Project and Manipulate Columns
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X