JobTracker and TaskTracker JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1 (or Hadoop version 1). Both processes are now deprecated in […]
NameNode and DataNode In this post let’s talk about the 2 important types of nodes and it’s functions in your Hadoop cluster – NameNode and DataNode. […]
How to change default replication factor? What Is Replication Factor? Replication factor dictates how many copies of a block should be kept in your cluster. […]
One of my close friends recently joined Microsoft in Seattle in their highly acclaimed data analysis team. I asked him what was his first assignment. He […]
This post will explain how can you approach the above question when asked in an interview. This is an open ended interview question and the interviewer […]
This post explains the class relationship when we use ToolRunner to run a MapReduce job. It is not really complicated but we use the below pictorial […]
This post explains how to unit test a MapReduce program using MRUnit. Apache MRUnit ™ is a Java library that helps developers unit test Apache Hadoop map […]
What is Million Song Dataset ? The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks. The […]
Executions in Hadoop use the underlying logged in username to figure out the permissions in the cluster. When running jobs or working with HDFS, the user […]
This post explains how to write a Oozie MapReduce action with Multiple Inputs and how each Inputs are configured to use different InputFormats and Mappers Lets […]
This post explains the fix when you see the below error when starting Datanode. 2013-12-14 23:39:09,354 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-hadoop-user/dfs/data: namenode namespaceID […]