Hadoop Administrator In Real World – Course Coverage
October 9, 2016Hadoop safemode recovery – taking too long!
January 12, 2017There are 0 datanode(s) running and no node(s)
You are trying to write a file to HDFS and this is what you see in your datanode logs. The error suggest that /user/ubuntu/test-dataset can not be replicated to any nodes in the cluster. This error usually means no datanodes are connected to the namenode. In other words, with out datanodes HDFS is not functional.
14/01/02 04:22:53 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 14/01/02 04:22:56 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/ubuntu/test-dataset could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1384) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2477) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:555) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:387) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:59582) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2048) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2044) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2042)
If you dig through the datanode logs, you will see errors when datanode attempted to connect to the namenode. Here in the log the datanode is trying to connect to 192.168.10.12:9000.
2014-01-13 12:41:02,332 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2014-01-13 12:41:02,334 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2014-01-13 12:41:03,427 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/192.168.10.12:9000. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-01-13 12:41:04,427 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/192.168.10.12:9000. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 2014-01-13 12:41:05,428 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: namenode/192.168.10.12:9000. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS)
Check configuration
Here is the core-site.xml configuration file for your set up. fs.default.name points to where namenode is running. In this setup, namenode is running at 192.168.10.12 and listening to port 9000. So from the datanode logs above, datanode is attempting to connect to the correct node. This means there is an issue with the namenode.
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://192.168.10.12:9000</value> </property> <property> <name>hadoop.tmp.dir</name> <value>/var/hadoop/tmp</value> </property> </configuration>
Check running process & port
We know that namenode should run and listen on port 9000 based on the above configuration file. Now in the namenode, let’s see where we have any process running on port 9000. Here is the output of the netstat. (you would see a lot of lines, look closer for port 9000).
From below we see a java process running but it is listening on 127.0.0.1:9000 and not on 192.168.10.12:9000. That is our issue. Since namenode is not listening on 192.168.10.12:9000, we get connection timeout.
sudo netstat -ntlp on the master , it shows:
tcp6 0 0 127.0.0.1:9000 :::* LISTEN 32646/java
Check how NameNode is resolved
/ect/hosts file in your system maps ip address with the machine name and here is what we have in the file currently –
192.168.10.12 localhost 192.168.10.12 namenode 127.0.0.1 localhost
Since we are using IP address in the configuration file, IP address 192.168.10.12 will be translated to host name localhost and localhost will then resolve to 127.0.0.1 this will cause the namenode to start and listen on 127.0.0.1:9000. Due to this reason datanode will not be able to reach namenode. An easy fix is to modify the /etc/hosts file like below. This way 192.168.10.12 will resolve to “namenode” instead of localhost as the top entry in the file takes precedence.
192.168.10.12 namenode 192.168.10.12 localhost 127.0.0.1 localhost
After the above change to the /etc/hosts file, restart the namenode and here is the result of netstat after the restart. Now we can the namenode process is running on 192.168.10.12:9000.
sudo netstat -ntlp tcp6 0 0 192.168.10.12:9000 :::* LISTEN 32646/java
Once you confirm that namenode is running on 192.168.10.12:9000 , restart all your datanodes and this time datanodes should be able to connect to namenode with no issues.