How to change default replication factor?
July 3, 2015JobTracker and TaskTracker
July 14, 2015NameNode and DataNode
In this post let’s talk about the 2 important types of nodes and it’s functions in your Hadoop cluster – NameNode and DataNode.
What is HDFS?
We covered a great deal of information about HDFS in “HDFS – Why Another Filesystem?” chapter in the Hadoop Starter Kit course. If you are new to Hadoop, we suggest to take the free course.
Namenode
- NameNode is the centerpiece of HDFS.
- NameNode is also known as the Master
- NameNode only stores the metadata of HDFS – the directory tree of all files in the file system, and tracks the files across the cluster.
- NameNode does not store the actual data or the dataset. The data itself is actually stored in the DataNodes.
- NameNode knows the list of the blocks and its location for any given file in HDFS. With this information NameNode knows how to construct the file from blocks.
- NameNode is so critical to HDFS and when the NameNode is down, HDFS/Hadoop cluster is inaccessible and considered down.
- NameNode is a single point of failure in Hadoop cluster.
- NameNode is usually configured with a lot of memory (RAM). Because the block locations are help in main memory.
DataNode
- DataNode is responsible for storing the actual data in HDFS.
- DataNode is also known as the Slave
- NameNode and DataNode are in constant communication.
- When a DataNode starts up it announce itself to the NameNode along with the list of blocks it is responsible for.
- When a DataNode is down, it does not affect the availability of data or the cluster. NameNode will arrange for replication for the blocks managed by the DataNode that is not available.
- DataNode is usually configured with a lot of hard disk space. Because the actual data is stored in the DataNode.
Hardware Configuration
Hardware configuration of nodes varies from cluster to cluster and it depends on the usage of the cluster. In Some Hadoop clusters the velocity of data growth is high, in that instance more importance is given to the storage capacity. If the SLAs for the job executions are important and can not be missed then more importance is give to the processing power of nodes.
Often the term “Commodity Computers” is misunderstood. Commodity Computers or Nodes does not mean cheap or less powerful hardware, it just means in-expensive computer and deemphasize the need for specialized hardware.
Here is a sample configuration for NameNode and DataNode hardware configuration.
Name Node Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 128 GB
Disk: 6 x 1TB SATA
Network: 10 Gigabit Ethernet
Data Node Configuration
Processors: 2 Quad Core CPUs running @ 2 GHz
RAM: 64 GB
Disk: 12-24 x 1TB SATA
Network: 10 Gigabit Ethernet
Like what you are reading? Enroll in our free Hadoop Starter Kit course & explore Hadoop in depth.
5 Comments
[…] 1. NameNode 2. DataNode 3. JobTracker 4. TaskTracker 5. ResourceManager (MRv2) 6. ApplicationMaster (MRv2) 7. NodeManager (MRv2) 8. SecondaryNameNode etc.. […]
[…] a scenario, if one stores a vast number of small files, there’s higher chances of overloading of NameNode that stores the namespace of HDFS, which is practically not a good […]
[…] a scenario, if one stores a vast number of small files, there’s higher chances of overloading of NameNode that stores the namespace of HDFS, which is practically not a good […]
[…] computer. Different operating systems may use a different distributed file system. To use HDFS, a NameNode must first partition the data into smaller data blocks, or […]
[…] + Read More Here […]