Finding the MAX tuple with Pig
February 2, 2017Working with HDFS
February 9, 2017In Understanding Big Data Problem post we saw that HDFS or Hadoop Distributed filesystem takes care of all the storage related complexities in Hadoop. In this post let’s understand more about HDFS.
You can get access to our free Hadoop cluster to try the commands in this post. Before you go on reading this post, please note that this post is from our free course named Hadoop Starter Kit. It is a free introductory course on Hadoop and it is 100% free. Click here to enroll to Hadoop Starter Kit. You will also get free access to our 3 node Hadoop cluster hosted on Amazon Web Services (AWS) – also free !
Let ‘s start this post by asking a question. Have you used any filesystem before?
The answer has to be yes. Just the fact that you are using a laptop or your portable device to watch this post, you are using a filesystem indirectly behind the scenes. filesystem is an integral part of every operating system. filesystem basically governs the storage in your hard disk.
Before you go on reading this post, please not that this post is from our free course named Hadoop Starter Kit. It is a free introductory course on Hadoop and it is 100% free. Click here to enroll to Hadoop Starter Kit. You will also get free access to our 3 node Hadoop cluster hosted on Amazon Web Services (AWS) – also free !
With and without a filesystem
Take a look at the below picture. Lets say I give a person a book and I give another person pile of un-ordered paper from the same book and I ask each of them to go to 34th Chapter. Who do you think will get to the 34th chapter faster? The one with the book, right? Because he can simply go the index and look for 34th chapter, look up the page number and go to the page. Where as the one with the pile of papers has to go through pile of papers and if he is lucky he might find chapter 34.
Just like a well organized book, a filesystem helps to navigate the data that is stored in your storage. With out the filesystem the information stored in your hard disk will be one large body of data where no way to tell where one piece of information stops and the next begins.
Here are some of the major functions of a filesystem.
- Filesystem controls how the data is stored and retrieved. Basically when you read and write files to your harddisk your request goes through a filesystem.
- Next, filesystem has the meta data about your files and folders. Metadata like the file name, size, owner, created and modified time etc.
- Filesystem also takes care of permissions and security
- Filesystem manages your storage space. when you ask to write a file to harddisk. Filesystem helps figure out where in the hard disk it should write the file. and it should write the file as efficient as possible.
In the beginning of the post we mentioned that filesystem is an integral part of your operating system. So lets look at some popular filesystems.
Although very old, the most legendary filesystem from Microsoft is FAT 32. Maximum file size FAT 32 can support is 4 GB. so if you have a file which is 5 GB in size you are out of luck with FAT 32 and it has a 32 GB volume or logical drive limit. Thee numbers listed in this slide are base lines nos, the size limits can be more or less based on the filesystem configuration. So if you used Windows 95, 98 or Millenium version you probably used FAT 32.
Next genenration filesystem from windows after FAT 32 is NTFS (New Technology FileSystem) and it supports 16 Exabyte file and volume limit. 16 Exabyte is is very huge which is 1024 petabyte. So NTFS can support huge volume of data. Starting windows server 2012 windows introduced ReFS – Resilient filesystem
If you are using Windows. how can we find out what is the filesystem ?
Very simple to find out actually. Go to My Computer and select a drive, go to properties and you can find the filesystem in use.
Lets look at Mac filesystems now. HFS or Hierarchial filesytem is a legacy from MAC. Apple started using HFS+ from MAC OS 8.1 and above. For eg. if you used iPod, you would have used HFS+. HFS+ can also handle huge volume about 8 Exa byte is size.
ext is most popular filesystems in Linux. ext3 is the 3rd gen linux filesytem in use since 2001. Then came ext4. ext4 can support individual file sizes upto 16 TB and volumes up to 1 exabyte.
XFS is created by Silicon Graphics and it can support up to 8 Exabyte in file and volume limit.
How can you look up your filesystem in linux?
Simply login to your Linux terminal and in the prompt type in the command df -T. You can see the file sytem. In the below screenshot, the operating system on this linux terminal is Ubuntu 14.04 and the filesytem used is ext 4.
So clearly recent filesystems can handle individual file sizes up to 8 Exabyte or even up to 16 Exabyte. So clearly we have file sytems where we can store big datasets. Then what is the need for HDFS ?
Lets recap what we learnt from the Understanding the Big Data Problem post. We saw that to support truly parallel computation we had to divide the dataset in to blocks and store them in different nodes and to recover from data loss we also replicated each block in more than one node.
Local filesystem vs. HDFS
Assume you have a 10 node cluster and you have ext4 as the filesystem on each node. We will refer to ext4 on each node as the local filesystem. So first, when we upload a file to your filesystem we need the filesystem to divide the dataset in to fixed size blocks. Although every filesystem has a concept of block, the concept of blocks in HDFS is very different when compared to the blocks in traditional filesystem.
Next, your filesystem should have a distributed view of the files or blocks in your cluster which is not possible with your local filesystem which is ext4.
What we mean is, your local ext4 filesystem on node1 has no idea what is on node 2. Similarly node 2 has no idea what is on node 1, because since the ext4 filesystems in both node 1 and node 2 are local to each node and hence there is no way they can have a global or distributed view. That is why we say the ext4 on individual nodes as local filesystem.
Next important thing is replication. Since ext4 in node 1 has no idea about storage in any other node, it does not have the ability to replicate blocks in node 1 to other nodes. This means we are exposed to data loss. Now assume you have a filesystem on top of ext4 but only this time it is spread accross all the nodes and tadaaaa !! We call that Hadoop Distributed File System.
So now when you upload a file to HDFS. it will be automatically split in to 128 MB fixed size blocks in the recent versions of Hadoop. 64 MB blocks in legacy versions. HDFS takes care of placing the blocks in different nodes and also take care of replicating each block in more than one node. By default HDFS replicates a block to 3 nodes.
Lets say you copy a 700 MB dataset in to HDFS. HDFS will divide the dataset in to 128 MB blocks. So we will have 5 equal sized 128 MB block and one 60 MB block. Since HDFS has a distributed view of the cluster, HDFS will decide which nodes should hold these 6 blocks and also pick nodes to hold the replicated blocks. HDFS will continue to keep track of all the blocks and their node assignments all the time. So when a user ask HDFS about the 700 MB dataset it know how to construct the file from blocks.
Let me ask you a question. this is an excellent interview question. When you have HDFS what happens to the local filesystem ext4 which in on each node?
HDFS by no means a replacement for your local filesystem. The operating system still rely on the local filesystem. Infact the operating system does not care about the presence of HDFS. One more interesting thing. HDFS should still go through ext4 to save the blocks in the storage.
The true power of HDFS is that it is spread across all the nodes in your cluster and it has a distrbuted view of the cluster and hence it knows how to construct the 700 MB dataset from blocks. Where as the ext4 does not have a distrubuted view and only has a local view will only have idea about the blocks in storage it is managing.
O.K. That explains the need for a filesystem like HDFS in a distributed environment like Hadoop.
Let’s summarize the benefits and functionalities of HDFS.
- First of all, HDFS supports the concepts of blocks. when you upload a file to HDFS, the file is divided in to fixed size blocks to support distributed compution this is key for Hadoop and also HDFS keep track of all blocks in the cluster.
- Disk Failures or data Corruption are inevitable in big data environment. HDFS maintains data integrity and help recover from data loss by replicating the blocks in more than one node.
- HDFS supports scaling, that is, if you would like to expand your cluster by adding more nodes, it is very easy to do with HDFS.
- You don’t need any specialized hardware to operate HDFS. HDFS was built ground up to work with commodity computers.
So lets summarize what we learnt in this post. First we looked at what is a filesystem, it’s functions and few of the major filesystems that are available right now. Next we talked about the need for a new distributed file sytem like HDFS and compared a local filesystem like ext4 with HDFS. Finally we saw the benefits of HDFS.