How does Broadcast Hash Join work in Spark?
January 15, 2021What is the difference between Hive internal tables and external tables?
January 20, 2021This is a very common need in the day to day Big Data world with a simple solution.
Solution
Use the hdfs du command to get the size of a directory in HDFS.
hdfs -du -s -h /path/to/dir
– du stands for disk usage
-s stands for summary to aggregate the size of files
-h stands for human readable (for e.g 64.0m instead of 67108864)
-v to display column names as header in the output
-x to exclude snapshots from the result. Snapshots are read only, point in time copies of a folder structure in HDFS. Usually used by Hadoop admins to preserve a copy of the files and folders at a point in time.