How to list all the available brokers in a Kafka cluster?
January 11, 2021How does Broadcast Hash Join work in Spark?
January 15, 2021When you run a MapReduce or a Spark job the number of files will equal to the number of reducers involved in the MapReduce job or number of tasks involved in the last stage of Spark job.
It is quite easy to combine the multiple files into one file. You can write a small utility script with Linux commands to do the same but why not use the power of a distributed cluster to achieve the same?
Do you like us to send you a 47 page Definitive guide on Spark join algorithms? ===>
Solution
Use hadoop fs -getmerge to combine multiple output files to in to one.
hadoop fs -getmerge [-nl] <src> <localdst>
Takes a source directory and a destination file as input and concatenates files in src into the destination local file.
Optionally -nl can be set to enable adding a newline character (LF) at the end of each file.
Input – directory and output – file
hadoop fs -getmerge -nl /src /home/hirw/big-output-file
Input – set of files and output – file
hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /home/hirw/big-output-file