What is the difference between explode and posexplode functions in Hive?
May 14, 2021How to find duplicate elements or rows in a Spark DataFrame?
May 19, 2021In a Big Data platform, when you manage a lot of data, it is quite inevitable at some point that data could be corrupted. This post deals will how to deal with corrupt files. Corruption could result from bad storage devices or a systemic failure in network transmission.
Note that we are not talking about fixing the corrupted files. Fixing requires you to know what is corrupted, how and where and it is hard to find those information in a corrupted file. What we could do when we encounter corruption is to clean or remove the corrupted files from HDFS as leaving corrupted files unattended could result in namenode stuck in safe mode when you have to restart namenode.
Identifying corrupted files
Hadoop fsck (file system check) command is a great to inspect the health of the filesystem.
hdfs fsck / will give you a report like below which will help you check the health of the cluster and give you a count of the number of corrupt blocks but it doesn’t provide you with the list of files which are corrupted.
hdfs fsck / ….. ….. ….. Total size: 1943153298 B (Total open files size: 3422 B) Total dirs: 137 Total files: 830 Total symlinks: 0 (Files currently being written: 2) Total blocks (validated): 819 (avg. block size 2372592 B) (Total open file blocks (not validated): 2) ******************************** UNDER MIN REPL'D BLOCKS: 1 (0.12210012 %) dfs.namenode.replication.min: 1 CORRUPT FILES: 1 MISSING BLOCKS: 1 MISSING SIZE: 28160 B CORRUPT BLOCKS: 1 ******************************** Minimally replicated blocks: 818 (99.8779 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 818 (99.8779 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 1.997558 Corrupt blocks: 1 Missing replicas: 818 (33.292633 %) Number of data-nodes: 2 Number of racks: 1 FSCK ended at Tue Nov 10 15:51:30 UTC 2020 in 366 milliseconds
hdfs fsck -list-corruptfileblocks will list the files with corrupted blocks
[hdfs@wk1 osboxes]$ hdfs fsck -list-corruptfileblocks Connecting to namenode via http://ms1.hirw.com:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F The list of corrupt files under path '/' are: blk_1073742622 /ranger/audit/hdfs/20200813/hdfs_ranger_audit_ms1.hirw.com.log The filesystem under path '/' has 1 CORRUPT files
Remove corrupted files
This step is straightforward once you know the list of files which are corrupted.
Issue a hdfs dfs -rm on all corrupted files.
hdfs dfs -rm /ranger/audit/hdfs/20200813/hdfs_ranger_audit_ms1.hirw.com.log
Avoid corruption
Corruptions are costly and hard to recover so take every possible step to avoid them.
- Make sure storage devices in the cluster are in good shape. Do routine health checks on the nodes in your cluster.
- Recover broken nodes quicker and add it back to the cluster.
- Take snapshots or back up data which you really can not afford to lose or can be easily reproduced.