What is DistCp?
June 25, 2015How to change default replication factor?
July 3, 2015How to change default block size in HDFS?
In this post we are going to see how to upload a file to HDFS overriding the default block size. In the older versions of Hadoop the default block size was 64 MB and in the newer versions the default block size is 128 MB.
Let’s assume that the default block size in your cluster is 128 MB. What if you want all the datasets in your HDFS to have the default block size of 128 MB but you would like one specific dataset to be stored with block size of 256 MB?
Refer to the “HDFS – Why Another Filesystem” chapter in the FREE Hadoop Starter Kit course to learn more about the block size in other filesytems.
Why would you want to make the block size of specific dataset from 128 to 256 MB?
To answer this question, you need to understand what is the benefit of having a larger block size. A single HDFS block (64 MB or 128 MB or ..) will be written to disk sequentially. When you write the data sequentially there is a fair chance that the data will be written into contiguous space on disk which means that data will be written next to each other in a continuous fashion. When a data is laid out in the disk in continuous fashion it reduces the number of disk seeks during the read operation resulting in an efficient read. So that is why block size in HDFS is huge when compared to the other file systems.
Now the question becomes should I make my dataset 128 MB or 256 MB or even more? It all depends on your cluster capacity and the size of your datasets. Lets say you have a dataset which is 2 Petabytes in size. Having a 64 MB block size for this dataset will result in 31 million+ blocks which would put stress on the NameNode to manage all that blocks. Having a lot of blocks will also result in a lot of mappers during MapReduce execution. So in this case you may decide to increase the block size just for that dataset.
Try it out
Try the commands in our cluster. Click here to get FREE access to the cluster.
Changing the block size when you upload a file in HDFS is very simple.
--Create directory if it does not exist hadoop fs -mkdir blksize --Copy a file to HDFS with default block size (128 MB) hadoop fs -copyFromLocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv blksize --Override the default block size with 265 MB hadoop fs -D dfs.blocksize=268435456 -copyFromLocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv blksize/dwp-payments-april10_256MB.csv
Verify in UI
Make sure to update the <username> with your username and <dir>
http://ec2-54-175-1-192.compute-1.amazonaws.com:50070/explorer.html#/user/<username>/<dir>
For eg. –
http://ec2-54-175-1-192.compute-1.amazonaws.com:50070/explorer.html#/user/hirwuser150430/blksize