Creating EC2 Instances in AWS to Launch a Hadoop Cluster
September 27, 2017What is RDD?
October 11, 2017We hosted a webinar Saturday, September 30th 2017 and the topic that was covered was RCFile vs. ORC. We had over 60 participants in the webinar. So first of all, we would like to thank everyone who joined the webinar. We always like hosting webinars and going live. This gives us a great way to interact with Hadoop In Real World community/students live. From the participants who shared where they are from – we had participants for India, US, Canada and UK. The webinar was very engaging and interactive.
Row vs. Columnar
The topic of discussion was RCFile vs. ORC. So we started the discussion explaining what is a row major and column major format. We went over the differences between a row major and a column major format; followed by discussing the advantages and disadvantages of each.
RCFile – Motivation & Lazy Decompression
RCFile is a joint effort from Facebook, Ohio State University, and the Institute of Computing Technology at the Chinese Academy of Sciences. We explained the motivation behind RCFile and how it combines the advantages of both record and columnar formats and what benefits it brings to the big data space. Discussion on RCFile is not complete with out explaining Lazy Decompression. RCFile uses a technique called Lazy Decompression which offers optimization benefits during execution. We explained what is lazy decompression and how it is used in RCFile.
ORC – Inefficiencies with RCFile, structure & indexes
Next, we went in to ORC (Optimized RCFile). ORC is an open source tool from Hortonworks. We started discussing the inefficiencies of RCFile and the need for optimizations to RCFile. We briefly looked at the structure of the ORC file. One of the strong selling points of ORC is statistics or metadata about the columns. ORC stores these statistics or indexes at 3 locations – file level, stripe level and row level. We took a very simple query with a where condition and illustrated how ORC can skip files, stripes or row groups based on the where condition and statistics stored in the indexes. Finally we finished the discussion by touching on the efficient compression and encoding in ORC.
Q&A
Participants were super engaging and here are some of the questions in our Q&A.
- Is it a must to use these file formats in production?
- What is Parquet?
- How does ORC compares with Parquet?
- Given a choice which file format should I choose and why?
- Why RCFile’s rowgroup size is very small (4 MB)?
- Can these file formats be used with Hive?
- How to append to these file formats?
- Questions about upcoming releases and courses from hadoopinrealworld.com.
We quite often hosts webinars like these and sign up below to get invitations to join one of our webinars.
Here is the full recording of the webinar. Enjoy!