How to find directories in HDFS which are older than N days?
January 30, 2017HDFS – Why another file system?
February 6, 2017Finding the MAX tuple with Pig
Here is a sample dataset. Our goal is to find the record with maximum record_value which is DEF, 300
record_key, record_value ABC,100 DEF,300 GHI,40 XYZ,150
Script
Here is a very short and simple script to do find the max value.
dataset = LOAD ‘max-records-test.txt' USING PigStorage(',') AS (record_key: chararray, record_value: long); A = GROUP dataset ALL; B = FOREACH A GENERATE MAX(dataset.record_value) AS max_val; C = FILTER dataset BY record_value == (long)B.max_val; DUMP C;
Explanation
GROUP dataset ALL groups all records in the dataset in to one tuple or one row. The output of the group operation will look like below. Since we are grouping all records, the result is just one record or tuple. So you have a tuple with 2 columns. The 2nd column is an interesting column it is a nested column which is bag of tuples.
(all, {(ABC,100),(DEF,300),(GHI,40),(XYZ,150)})
FOREACH A GENERATE MAX(dataset.value) will get the max record_value from the set of tuples – (ABC,100),(DEF,300),(GHI,40),(XYZ,150).
B will be assigned 300.
Next instruction filters the record from the dataset with record_value 300. Finally print the record with maximum record_value.
C = FILTER dataset BY record_value == (long)B.max_val; DUMP C;