Apache Pig Tutorial -Load Variations
December 14, 2015Apache Pig Tutorial – Filter Records
December 18, 2015Apache Pig Tutorial – Project and Manipulate Columns
Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.
In this post, we saw different variations of loading a dataset in Pig. In this post we will see how to project and manipulate columns.
Load Dataset With Column Names and Types
grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Use FOREACH..GENERATE operator for projecting the columns in the dataset.
grunt> projection = FOREACH stocks GENERATE symbol, SUBSTRING(exchange, 0, 1) as sub_exch, close - open as up_or_down;
Now projection will have 3 columns – symbol, sub_exch which is a result of a substring operation on the exchange column and up_or_down which is a result of close price minus the opening price.
grunt> top100 = LIMIT projection 100; grunt> DUMP top100;
Load Dataset With Out Column Names and Types
grunt> stocks1 = LOAD '/user/hirw/input/stocks' USING PigStorage(','); grunt> DESCRIBE stocks1; Schema for stocks1 unknown.
With schema unknown for stocks1 relation how can we project and manipulate column? Very simple, we can use the column position starting with index 0 for first column.
grunt> projection = FOREACH stocks1 GENERATE $1 as symbol, SUBSTRING($0, 0, 1) as sub_exch, $6 - $3 as up_or_down;
How does Pig figure out the datatypes? Pig is smart, it looks at the operation you are tying to perform on the columns and cast the type from the default bytearrary to appropriate type. For example since we are trying to do a substring operation on the exchange column ($0), Pig assumes exchange column in string and will cast bytearray to chararray.
Similarly close ($6) and open ($3) are converted to double (casting to integer or float might lose precision) since these columns are involved in numerical calculation.
grunt> DESCRIBE projection; 2015-12-09 12:41:55,172 [main] WARN org.apache.pig.PigServer - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s). 2015-12-09 12:41:55,172 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s). 2015-12-09 12:41:55,172 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_DOUBLE 2 time(s). projection: {symbol: bytearray,sub_exch: chararray,up_or_down: double}
Finally, display the results.
grunt> top100 = LIMIT projection 100; grunt> DUMP top100;
See It In Action
Previous Lesson : Load Variations
1 Comment
[…] Previous Lesson : Project and Manipulate Columns […]