Datanode Block Scanner
September 8, 2015Hadoop Archives (HAR)
October 5, 2015Pig vs. Hive
Apache Pig takes in a set of instructions written in Pig Latin, compiles them and produce a set of MapReduce jobs and execute all those MapReduce jobs in Hadoop cluster.
Apache Hive takes in a “SQL like” query as input, compiles them and produce a set of MapReduce jobs and execute all those MapReduce jobs in Hadoop cluster.
Both Apache Pig and Hive are widely used in Hadoop environments and must know tools for any aspiring Hadoop developer. If you look at the above two descriptions for the tools you will see they sound a lot similar which raises the following questions –
- Why do we have two tools performing somewhat similar operation in Hadoop ecosystem?
- Does Pig and Hive co-exist in Hadoop production environments?
- Which tool is better – Pig or Hive?
Why do we have two tools performing somewhat similar operation in Hadoop ecosystem?
Pig and Hive were developed by Yahoo and Facebook respectively to solve the same problem (i.e. to make Hadoop easily accessible for non programmers) around the same time. The capabilities of either tool were not fully transparent to both companies at the early stages of development which resulted in the overlap.
Does Pig and Hive co-exist in Hadoop production environments?
The answer is yes. We have seen successful Hadoop implementations using both Pig and Hive in the same environment.
Here is one such use case – you can use pig for standard nightly Extract Transform and Load (ETL) kind of jobs doing predefined aggregation, data clean up, filtering and structuring etc. and Hive can be used by developers, data analysts and scientists on a day to day basis for adhoc analysis of data.
Which tool is better – Pig or Hive?
There is no straight forward answer. Both tools are equally important and have strong user base and communities. Both tools can be highly configurable and allow easy integration with custom Java code.
Pig Latin is an easy to learn instructional language and don’t think of it as a new programming language as it is easy to follow and learn.
Hive has a shorter learning curve because anyone who is familiar with SQL will feel right at home with the tool. Hive does allow developers to see the data in row columnar fashion which is a great plus.