Is Hive Good At Everything?
October 13, 2015Apache Pig Tutorial -Load Variations
December 14, 2015Apache Pig Tutorial – Loading Datasets
Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.
What is Apache Pig & why it is awesome?
Apache Pig, originally developed by Yahoo! now an top level Apache open source project. Apache Pig takes in a set of instructions from user called Pig Latin instructions, analyze them and convert them in to MapReduce program and execute the instructions in a Hadoop cluster. Why this is awesome?It is awesome because we don’t need to know a programming language to write and execute MapReduce jobs in Hadoop.
Loading datasets
Pig instructions can be interactively executed in GRUNT shell (look at the video).
Pig Latin is a collection of simple to use instructions. User write Pig latin instruction, Pig analyze them and convert to MapReduce jobs. LOAD operator is used to load the dataset in Pig.
Dataset in /user/hirw/input/stocks (HDFS location) is a comma delimited stocks dataset with day by day stocks information for several stock symbols.
ABCSE,KBT,2001-05-22,11.75,11.75,10.80,11.10,765000,5.04 ABCSE,KBT,2001-05-21,11.75,11.87,11.51,11.75,814600,5.33 ABCSE,KBT,2001-05-18,12.10,12.10,11.83,11.85,250600,5.38 ABCSE,KIB,2010-02-08,3.03,3.15,2.94,3.10,455600,3.10 ABCSE,KIB,2010-02-05,3.01,3.06,2.96,3.04,635400,3.04 ABCSE,KIB,2010-02-04,3.03,3.50,2.85,3.05,990400,3.05
PigStorage(,) instruct Pig that we are loading a comma delimited dataset. Note that the column names are listed in the same order as it is laid out in the datasets along with its data types. Also, the data types resemble the datatypes in Java.
grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);
Right after this instruction, stocks can be used to refer the loaded dataset for the remaining of the Pig instructions for the given grunt session.
Display results on screen
DUMP operator is used to display the results on screen.
grunt> DUMP stocks;
Limit results
Don’t like to see all the results? Use the LIMIT operator.
grunt> top100 = LIMIT stocks 100; grunt> DUMP top100;
See It In Action
Next Lesson: Load Variations
1 Comment
[…] Previous Apache Pig Tutorial -Load Variations […]