Data Locality in Hadoop - Big Data In Real World

Data Locality in Hadoop

Hadoop Modes
July 19, 2015
HDFS Block Placement Policy
August 1, 2015
Hadoop Modes
July 19, 2015
HDFS Block Placement Policy
August 1, 2015

Data Locality in Hadoop

Data Locality in Hadoop refers to the “proximity” of the data with respect to the Mapper tasks working on the data.

Why is Data Locality important?

When a dataset is stored in HDFS, it is divided in to blocks and stored across the DataNodes in the Hadoop cluster. When a MapReduce job is executed against the dataset the individual Mappers will process the blocks (Input Splits). When the data is not available for the Mapper in the same node where it is being executed, the data needs to be copied over the network from the DataNode which has the data to the DataNode which is executing the Mapper task.

Imagine a MapReduce job with over 100 Mappers and each Mapper is trying to copy the data from another DataNode in the cluster at the same time, this would result in serious network congestion as all the Mappers would try to copy the data at the same time and it is not ideal. So it is always effective and cheap to move the computation closer to the data than to move the data closer to the computation.

How is data proximity defined?

When a JobTracker (MRv1) or ApplicationMaster (MRv2) receive a request to run a job, it looks at which nodes in the cluster has sufficient resources to execute the Mappers and Reducers for the job. At this point  serious consideration is made to decide on which nodes the individual Mappers will be executed based on where the data for the Mapper is located.

Data Locality In Hadoop

Data Local

When the data is located on the same node as the Mapper working on the data, it is referred to as Data Local. In this case the proximity of the data is closer to the computation. The JobTracker (MRv1) or ApplicationMaster (MRv2)  prefers the node which has the data that is needed by the Mapper to execute the Mapper.

Rack Local

Although Data Local is the ideal choice, it is not always possible to execute the Mapper on the same node as the data due to resource constraints on a busy cluster. In such instances it is preferred to run the Mapper on a different node but on the same rack as the node which has the data. In this case, the data will be moved between nodes from the node with the data to the node executing the Mapper with in the same rack.

Different Rack

In a busy cluster sometimes Rack Local is also not possible. In that case, a node on a different rack is chosen to execute the Mapper and the data will be copied from the node which has the data to the node executing the Mapper between racks. This is the least preferred scenario.

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

  1. […] than the block size. Doing so will decrease the number of mappers but at the expense of sacrificing data locality because now an InputSplit will comprise data from atleast two blocks and both the blocks may not be […]

Data Locality in Hadoop
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X