Changing Number Of Mappers

Number of mappers always equals to the Number of splits. Having said that it is possible to control the number of splits by changing the mapred.min.split.size which controls the minimum input split size.

Assume the block size is 64 MB and mapred.min.split.size is set to 128 MB. Hadoop uses the below formula to calculate the size of the split and it will come up to 128 MB. (lets assume max size is 256 MB)

Input Split Size = max(minimumSize, min(maximumSize, blockSize))

Let’s now substitute the values –

max(128 MB , min(256 MB, 64 MB) = max (128 MB, 64 MB) = 128 MB

The size of InputSplit will be 128 MB even though the block size is 64 MB.

Is There a Benefit In Doing This?

However, there is no real benefit in forcing the split size to be greater than the block size. Doing so will decrease the number of mappers but at the expense of sacrificing data locality because now an InputSplit will comprise data from atleast two blocks and both the blocks may not be available on the same DataNode.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

Changing Number Of Mappers

InputSplit vs Block

Changing Number Of Reducers

InputSplit vs Block

Changing Number Of Reducers

Changing Number Of Mappers

Is There a Benefit In Doing This?

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!