What is DistCp? - Big Data In Real World

What is DistCp?

What is Hadoop
What is Hadoop?
June 20, 2015
How to change default block size in HDFS
June 25, 2015
What is Hadoop
What is Hadoop?
June 20, 2015
How to change default block size in HDFS
June 25, 2015

Working With HDFS” chapter in our Hadoop Starter Kit course covers the details on working with HDFS. In that chapter we looked at how to copy, move, delete files in HDFS. In this post let’s talk about a command which can be used to copy large volume of files or dataset in a distributed fashion inside your cluster or even between Hadoop clusters.

What is DistCp?

DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. DistCp is very efficient because it uses MapReduce to copy the files or datasets and this means the copy operation is distributed across multiple nodes in your cluster and hence it is very effective as opposed to a hadoop fs -cp operation.

How does it work?

DistCp expands a list of files and directories and distribute the work between multiple Map tasks, each of the Map task will copy a partition of the files specified in the source list.

Syntax

Here is the syntax. You have to specify the fully qualified name. The source and the destination namenode and port information will be same if you copying files with in the cluster and will be different if you are trying to copy between 2 different clusters. When you are trying to copy files between 2 clusters both HDFS should be on the same version or the higher version must be backward compatible.

hadoop distcp hdfs://namenode:port/source hdfs://namenode:port/destination

 

With the above syntax the entire source folder will be copied (not moved) in to the destination folder.

-update and -overwrite Flag

When you use -update or -overwrite flag as below, only the contents from the source folder are copied and not the entire folder.

hadoop distcp -update hdfs://namenode:port/source hdfs://namenode:port/destination

 

When you use update flag –

When the destination has the same file name, file size and contents the file in the destination will not be updated
When the destination has the same file name but different file size and contents the file in the destination will be overwritten

When you use overwrite flag –

When the destination has the same or different – file name, file size and contents the file in the destination will be overwritten

Multiple Sources

You can also copy files from multiple sources in to destination. Here is the syntax –

hadoop distcp hdfs://namenode:port/source1 hdfs://namenode:port/source2 hdfs://namenode:port/destination

 

Or alternativey, you can save all the source location in a file and give the file using the -f flag

hadoop distcp -f hdfs://namenode:port/srclist hdfs://namenode:port/destination

 

Try it out

Try the commands in our cluster. Click here to get FREE access to the cluster.

--Create directories if it does not exist

hadoop fs -mkdir distcp_src

hadoop fs -mkdir distcp_dest

--Copy the file if not present
hadoop fs -copyFromLocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv distcp_src

--Copy from source to destination. Replace <username> with the username that was provided to you.
hadoop distcp hdfs://ip-172-31-45-216.ec2.internal:8020/user/<username>/distcp_src hdfs://ip-172-31-45-216.ec2.internal:8020/user/<username>/distcp_dest

--For eg.
hadoop distcp hdfs://ip-172-31-45-216.ec2.internal:8020/user/hirwuser150430/distcp_src hdfs://ip-172-31-45-216.ec2.internal:8020/user/hirwuser150430/distcp_dest

--Check destination directory after executing distcp
hadoop fs -ls distcp_dest

 

Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

What is DistCp?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X