What is Hadoop?
June 20, 2015How to change default block size in HDFS
June 25, 2015“Working With HDFS” chapter in our Hadoop Starter Kit course covers the details on working with HDFS. In that chapter we looked at how to copy, move, delete files in HDFS. In this post let’s talk about a command which can be used to copy large volume of files or dataset in a distributed fashion inside your cluster or even between Hadoop clusters.
What is DistCp?
DistCp (distributed copy) is a tool used for large inter/intra-cluster copying. DistCp is very efficient because it uses MapReduce to copy the files or datasets and this means the copy operation is distributed across multiple nodes in your cluster and hence it is very effective as opposed to a hadoop fs -cp operation.
How does it work?
DistCp expands a list of files and directories and distribute the work between multiple Map tasks, each of the Map task will copy a partition of the files specified in the source list.
Syntax
Here is the syntax. You have to specify the fully qualified name. The source and the destination namenode and port information will be same if you copying files with in the cluster and will be different if you are trying to copy between 2 different clusters. When you are trying to copy files between 2 clusters both HDFS should be on the same version or the higher version must be backward compatible.
hadoop distcp hdfs://namenode:port/source hdfs://namenode:port/destination
With the above syntax the entire source folder will be copied (not moved) in to the destination folder.
-update and -overwrite Flag
When you use -update or -overwrite flag as below, only the contents from the source folder are copied and not the entire folder.
hadoop distcp -update hdfs://namenode:port/source hdfs://namenode:port/destination
When you use update flag –
When the destination has the same file name, file size and contents the file in the destination will not be updated
When the destination has the same file name but different file size and contents the file in the destination will be overwritten
When you use overwrite flag –
When the destination has the same or different – file name, file size and contents the file in the destination will be overwritten
Multiple Sources
You can also copy files from multiple sources in to destination. Here is the syntax –
hadoop distcp hdfs://namenode:port/source1 hdfs://namenode:port/source2 hdfs://namenode:port/destination
Or alternativey, you can save all the source location in a file and give the file using the -f flag
hadoop distcp -f hdfs://namenode:port/srclist hdfs://namenode:port/destination
Try it out
Try the commands in our cluster. Click here to get FREE access to the cluster.
--Create directories if it does not exist hadoop fs -mkdir distcp_src hadoop fs -mkdir distcp_dest --Copy the file if not present hadoop fs -copyFromLocal /hirw-starterkit/hdfs/commands/dwp-payments-april10.csv distcp_src --Copy from source to destination. Replace <username> with the username that was provided to you. hadoop distcp hdfs://ip-172-31-45-216.ec2.internal:8020/user/<username>/distcp_src hdfs://ip-172-31-45-216.ec2.internal:8020/user/<username>/distcp_dest --For eg. hadoop distcp hdfs://ip-172-31-45-216.ec2.internal:8020/user/hirwuser150430/distcp_src hdfs://ip-172-31-45-216.ec2.internal:8020/user/hirwuser150430/distcp_dest --Check destination directory after executing distcp hadoop fs -ls distcp_dest