Apache Pig Tutorial – Executing Script with Parameters
December 20, 2015Apache Pig Tutorial – Map
December 31, 2015Apache Pig Tutorial – Tuple & Bag
Goal of this tutorial is to learn Apache Pig concepts in a fast pace. So don’t except lengthy posts. All posts will be short and sweet. Most posts will have (very short) “see it in action” video.
So far we have been using simple datatypes in Pig like chararray, float, int etc.. In this post we will see 2 complex types in Pig – Tuple & Bag.
To demonstrate this, look at the below set of instructions to group stock records by symbol from year 2003. Let’s describe grp_by_sym to look at the structure.
grunt> stocks = LOAD '/user/hirw/input/stocks' USING PigStorage(',') as (exchange:chararray, symbol:chararray, date:datetime, open:float, high:float, low:float, close:float, volume:int, adj_close:float); grunt> filter_by_yr = FILTER stocks by GetYear(date) == 2003; grunt> grp_by_sym = GROUP filter_by_yr BY symbol; grunt> DESCRIBE grp_by_sym;
Dissect grp_by_sym
grunt> DESCRIBE grp_by_sym; grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}
From the output, we can see that the datatype of group is chararry. What is the datatype of filter_by_yr ?
You can see the structure of filter_by_yr has a curly braces { followed by a parenthesis ( . Whenever you see a curly braces it is referred to as a Bag. Whenever you see a parenthesis it is referred to as a Tuple.
Tuple vs. Bag
Tuple is nothing but a record – with a collection of columns, in our case exchange, symbol, date etc. Bag is nothing but a collection of records or Tuples. So if you look at the below structure of filter_by_yr , you can see filter_by_yr is a bag or in other words, filter_by_yr is a collection of records with columns exchange, symbol, date, open etc.
grunt> DESCRIBE grp_by_sym; grp_by_sym: {group: chararray,filter_by_yr: {(exchange: chararray,symbol: chararray,date: datetime,open: float,high: float,low: float,close: float,volume: int,adj_close: float)}}
You can also say grp_by_sym is a bag because it has curly braces. Although in this case the parenthesis is omitted from the display grp_by_sym represents a bag or a collection of records or Tuples. We can define grp_by_sym as a collection of Tuples with 2 columns in each Tuple – group which of type chararray and filter_by_yr which of type bag.
See It In Action
Previous Lesson : Execute Script with Parameters
Next Lesson : Map
1 Comment
[…] iframe { visibility: hidden; opacity: 0; } Previous Apache Pig Tutorial – […]