How to change default replication factor?

What Is Replication Factor?

Replication factor dictates how many copies of a block should be kept in your cluster. The replication factor is 3 by default and hence any file you create in HDFS will have a replication factor of 3 and each block from the file will be copied to 3 different nodes in your cluster.

Change Replication Factor – Why?

Let’s say you have a 1 TB dataset and the default replication factor is 3 in your cluster. Which means each block from the dataset will be replicated 3 times in the cluster. Let’s say this 1 TB dataset is not that critical for you, meaning if the dataset is corrupted or lost it would not cause a business impact. In that case you can set the replication factor on just this dataset to 1 leaving the other files or datasets in HDFS untouched.

Lets Try It

Try the commands in our cluster. Click to get get FREE access to the cluster.

Use the -setrep commnad to change the replication factor for files that already exist in HDFS. -R flag would recursively change the replication factor on all the files under the specified folder

--Create a folder if doesn't exist
hadoop fs -mkdir replication-test

--Upload a file in to the folder
hadoop fs -copyFromLocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv replication-test

--Check replication
hadoop fs -ls replication-test/dwp-payments-april10.csv

--Change replication to 2
hadoop fs -setrep -R -w 2 replication-test/dwp-payments-april10.csv

Note that if you are trying to set replication factor to a number higher than the number of nodes, the command will continue to try until it can replicate the desired number of blocks. so it will wait till you add additional nodes to the cluster.

You can also use dfs.replication property to specify the replication factor when you upload files in to HDFS but this will only work on the newly created files but not on the existing files.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to change default replication factor?

How to change default block size in HDFS

NameNode and DataNode

How to change default block size in HDFS

NameNode and DataNode

How to change default replication factor?

What Is Replication Factor?

Change Replication Factor – Why?

Lets Try It

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!