InputSplit vs Block
August 4, 2015Changing Number Of Reducers
August 11, 2015Changing Number Of Mappers
Number of mappers always equals to the Number of splits. Having said that it is possible to control the number of splits by changing the mapred.min.split.size which controls the minimum input split size.
Assume the block size is 64 MB and mapred.min.split.size is set to 128 MB. Hadoop uses the below formula to calculate the size of the split and it will come up to 128 MB. (lets assume max size is 256 MB)
Input Split Size = max(minimumSize, min(maximumSize, blockSize))
Let’s now substitute the values –
max(128 MB , min(256 MB, 64 MB) = max (128 MB, 64 MB) = 128 MB
The size of InputSplit will be 128 MB even though the block size is 64 MB.
Is There a Benefit In Doing This?
However, there is no real benefit in forcing the split size to be greater than the block size. Doing so will decrease the number of mappers but at the expense of sacrificing data locality because now an InputSplit will comprise data from atleast two blocks and both the blocks may not be available on the same DataNode.