Working with HDFS
February 9, 2017Hadoop Starter Kit – Tutorial
February 16, 2017With the Working with HDFS post, we now know how to work with HDFS. It is now time to look at the important components and processes that make HDFS function properly. In other words lets learn about the architecture of HDFS.
You can get access to our free Hadoop cluster to try the commands in this post. Before you go on reading this post, please note that this post is from our free course named Hadoop Starter Kit. It is a free introductory course on Hadoop and it is 100% free. Click here to enroll to Hadoop Starter Kit. You will also get free access to our 3 node Hadoop cluster hosted on Amazon Web Services (AWS) – also free !
Look at the output of file system check command below. This gives us the status of the file – whether it is healthy or not. It also will tell us the number of blocks that this file is made of and not only it will list we the blocks but also will list were the block locations. That is the nodes where the blocks are physically stored.
Now imagine we are running a 100 node Hadoop cluster and we have thousands of big datasets in our cluster. All our datasets are divided in to blocks and the blocks are spread across the 100 nodes. Now when we add replication to the equation, each block will be replicated 3 times. Managing all this information becomes very complex pretty quickly. We know HDFS has a global distributed view of the file system and it can knows how to construct a file from the blocks. But the question is how does HDFS accomplish that.
Time for some terminology. The nodes where the blocks are physically stored are known as the Data Nodes and they are named so because these nodes hold the actual data of the cluster. Each datanode knows the list of blocks it is responsible for. for eg. going back to the output of the FSCK command. datanode 192.168.56.101 knows that it is responsible for blk_1073741848_1024 .
But datanode misses a key information, it does not know that blockXXX and blockYYY belong to fileNNN. Also any given datanode knows only about the block it is responsible for and does not care to know about other blocks in other datanodes. Well it is a problem for us as users because we don’t know anything about the blocks and we quite honestly don’t care about the blocks. All we know is the file name and we should be able to work only with the file name.
So, if datanodes does not know which block belongs to which file, then who has that key information? This key information is maintained by a node called the namenode. Namenode keeps track of all the files or datasets in HDFS. For any given file in HDFS, Namenode knows the list of blocks that make up the file. Not only the list of blocks but also the location of the blocks.
Now we can understand the importance of the Namenode. Imagine namenode being down in our Hadoop cluster. There is no way we can look up the files in the Hadoop cluster because we won’t be able to figure out the list of blocks that make up the files and also we won’t be able to figure out the blocks’ location.
Other than the block locations, Namenode also has the metadata of the files and folder like the owner, size, replication factor, creation and modifed times of HDFS. Due to the significance of the Namenode it is called the Master node and the Datanodes are called the slave nodes.
Namenode persist all the Metadata information about files and folder in Harddisk, except for the block locations. So a very good interview question here.
Given block locations is vital for a functioning HDFS why Namenode is not persisting that information? Because Datanodes have that information. Remember, each Datanode knows the list of blocks it is responsible for, what it does not know that block X belong to file ABC. DataNodes and Namenodes are in constant communication with each other. so when a Namenode start up the datanodes will try to connect with the Namenode and broadcast the list of blocks that each datanode is responsible for.
Namenode will hold the block locations in memory and never persist the information. Because in a busy cluster, HDFS is constantly changing with new data files coming in constantly. If NameNode has to persist every changes to the block by writing the information in the hard disk it would become a bottleneck and hence with performance reasons in mind, the Namenode hold the block locations in memory. So for this reason, we can imagine Namenode is a powerful node in the cluster in terms of capacity.
Now we know that are two types of nodes in the cluster – Namenode and Datanode. We also understand the importance of Namenode. Namenode failure is clearly not an option but failures in computer environment is inevitable and we need to prepared for that.
So when a Namenode failes, there are recovery strategies including a Secondary Namenode, Active – Stand by setup, HDFS Federation.
How does the DataNodes know the location of the Namenode. Hadoop has a set of configuration files… in our cluster the configuration files are under /etc/hadoop/conf directory.
First lets look at core-site.xml, this file has a very important property fs.defaultFS. In this property we specify the location of the namenode. So the namenode for this cluster runs on ip-172-31-xxx-yyy.ec2.internal
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://ip-172-31-xxx-yyy.ec2.internal:8020</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.SnappyCodec</value> </property> <property> <name>hadoop.proxyuser.mapred.groups</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.mapred.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.oozie.hosts</name> <value>*</value> </property> <property> <name>hadoop.proxyuser.oozie.groups</name> <value>*</value> </property> </configuration>
There are few other properties in the file for different purposes and may not be HDFS specific. This file is made available to all the nodes in the cluster. so any node can look up this property and know the location of the namenode.
Next important property file is hdfs-site.xml. This files list all the properties that are very specific to HDFS.
dfs.namenode.name.dir specifies the location in the local fileystem where Namnode can store its files
fs.datanode.name.dir specifies the location in the local fileystem where Datanode can store its files and blocks
Next property is the HTTP address of the namenode. We can reach the namenode with this URL and get details about HDFS.
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.permissions.superusergroup</name> <value>hadoop</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///data/1/dfs/nn</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///data/1/dfs/dn</value> </property> <property> <name>dfs.namenode.http-address</name> <value>ec2-54-zzz-abc-xyz.compute-1.amazonaws.com:50070</value> <description> The address and the base port on which the dfs NameNode Web UI will listen. </description> </property> <property> <name>dfs.webhdfs.enabled</name> <value>false</value> </property> </configuration>
There is another important property which we don’t see here. which is dfs.replication, which specifies the replication factor of the cluster. That is how many times a block will be replicated in the cluster. The default value for the property is 3. So when we don’t see a property in the configuration files hadoop substitutes the default value for that property.
Next we have 2 more important files – masters and slaves.
masters file specifies the node name where the secondary namenode will be started. slaves file can optionally contain the list of all the datanodes.
Now we know the configuration details of the cluster. lets look at the hardware configuration of the nodes in the cluster. The hardware configruation varies from environment to environment and highly dependant on the usage of the cluster. cpu+ram+hardisk are the 3 primary elements that make up a node. The numbers listed here are just examples and will vary from cluster to cluster.
In this example both namenode and datanode have 2 quad core processors each. The important difference between the namenode and datanode is the memory and harddisk. The namenode is usually high on memory when compared to the datanode because it holds a lot of data like block locations in memory. Where as the datanode is high on harddisk space because it has to store the actual dataset. Finally all the nodes must be connected in a high speed network for efficient processing. In this example we have listed 10 gigabit ethernet.
Now we have an idea about the hardware configuration. Let’s get ourselves familiarize with some jargons.
A rack is a group of nodes connected in a network and a cluster is a group of racks connected in a network. so there by all nodes in the cluster are connected. Datacenters are physical locations where a cluster is housed. some years back datacenters used to be dark and ugly. but these days datacenters are state of the art.
Below picture is the inside of Facebook’s datacenter in Prinville, Oregon. beautiful isn’t it?
We learnt a lot in this post. Let’s summarize. we talked about the functionalities of namenode and datanodes, we looked at HDFS specific configuration properties and also hard ware configuration of both namenode and datanode. Finally we saw that a cluster is made up of racks and racks are made up of individual nodes.