Apache Pig Tutorial - Loading Datasets - Big Data In Real World

Apache Pig Tutorial – Loading Datasets

Is Hive Good At Everything?
October 13, 2015
Apache Pig Tutorial -Load Variations
December 14, 2015
Is Hive Good At Everything?
October 13, 2015
Apache Pig Tutorial -Load Variations
December 14, 2015

Apache Pig Tutorial – Loading Datasets

Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.

What is Apache Pig & why it is awesome?

Apache Pig, originally developed by Yahoo! now an top level Apache open source project. Apache Pig takes in a set of instructions from user called Pig Latin instructions, analyze them and convert them in to MapReduce program and execute the instructions in a Hadoop cluster. Why this is awesome?It is awesome because we don’t need to know a programming language to write and execute MapReduce jobs in Hadoop.

Loading datasets

Pig instructions can be interactively executed in GRUNT shell (look at the video).

Pig Latin is a collection of simple to use instructions. User write Pig latin instruction, Pig analyze them and convert to MapReduce jobs. LOAD operator is used to load the dataset in Pig.

Dataset in /user/hirw/input/stocks (HDFS location) is a comma delimited stocks dataset with day by day stocks information for several stock symbols.

ABCSE,KBT,2001-05-22,11.75,11.75,10.80,11.10,765000,5.04
ABCSE,KBT,2001-05-21,11.75,11.87,11.51,11.75,814600,5.33
ABCSE,KBT,2001-05-18,12.10,12.10,11.83,11.85,250600,5.38
ABCSE,KIB,2010-02-08,3.03,3.15,2.94,3.10,455600,3.10
ABCSE,KIB,2010-02-05,3.01,3.06,2.96,3.04,635400,3.04
ABCSE,KIB,2010-02-04,3.03,3.50,2.85,3.05,990400,3.05

 PigStorage(,)  instruct Pig that we are loading a comma delimited dataset. Note that the column names are listed in the same order as it is laid out in the datasets along with its data types. Also, the data types resemble the datatypes in Java.

grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float);

Right after this instruction, stocks  can be used to refer the loaded dataset for the remaining of the Pig instructions for the given grunt session.

Display results on screen

DUMP operator is used to display the results on screen.

grunt> DUMP stocks;

 Limit results

Don’t like to see all the results? Use the LIMIT operator.

grunt> top100 = LIMIT stocks 100;
grunt> DUMP top100;

 See It In Action

 

Next Lesson: Load Variations

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] Previous Apache Pig Tutorial -Load Variations […]

Apache Pig Tutorial – Loading Datasets
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X