How to deal with corrupt files in HDFS? - Big Data In Real World

How to deal with corrupt files in HDFS?

What is the difference between explode and posexplode functions in Hive?
May 14, 2021
How to find duplicate elements or rows in a Spark DataFrame?
May 19, 2021
What is the difference between explode and posexplode functions in Hive?
May 14, 2021
How to find duplicate elements or rows in a Spark DataFrame?
May 19, 2021

In a Big Data platform, when you manage a lot of data, it is quite inevitable at some point that data could be corrupted. This post deals will how to deal with corrupt files. Corruption could result from bad storage devices or a systemic failure in network transmission.

Note that we are not talking about fixing the corrupted files. Fixing requires you to know what is corrupted, how and where and it is hard to find those information in a corrupted file. What we could do when we encounter corruption is to clean or remove the corrupted files from HDFS as leaving corrupted files unattended could result in namenode stuck in safe mode when you have to restart namenode.

Identifying corrupted files

Hadoop fsck (file system check) command is a great to inspect the health of the filesystem.

hdfs fsck /  will give you a report like below which will help you check the health of the cluster and give you a count of the number of corrupt blocks but it doesn’t provide you with the list of files which are corrupted.

hdfs fsck / 
…..
…..
…..

Total size:    1943153298 B (Total open files size: 3422 B)
 Total dirs:    137
 Total files:   830
 Total symlinks:                0 (Files currently being written: 2)
 Total blocks (validated):      819 (avg. block size 2372592 B) (Total open file blocks (not validated): 2)
  ********************************
  UNDER MIN REPL'D BLOCKS:      1 (0.12210012 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         28160 B
  CORRUPT BLOCKS:       1
  ********************************

 Minimally replicated blocks:   818 (99.8779 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       818 (99.8779 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     1.997558
 Corrupt blocks:                1
 Missing replicas:              818 (33.292633 %)
 Number of data-nodes:          2
 Number of racks:               1
FSCK ended at Tue Nov 10 15:51:30 UTC 2020 in 366 milliseconds

hdfs fsck -list-corruptfileblocks  will list the files with corrupted blocks

[hdfs@wk1 osboxes]$ hdfs fsck -list-corruptfileblocks

Connecting to namenode via http://ms1.hirw.com:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The list of corrupt files under path '/' are:
blk_1073742622  /ranger/audit/hdfs/20200813/hdfs_ranger_audit_ms1.hirw.com.log
The filesystem under path '/' has 1 CORRUPT files

Remove corrupted files

This step is straightforward once you know the list of files which are corrupted.

Issue a hdfs dfs -rm on all corrupted files.

hdfs dfs -rm /ranger/audit/hdfs/20200813/hdfs_ranger_audit_ms1.hirw.com.log

Avoid corruption

Corruptions are costly and hard to recover so take every possible step to avoid them.

  • Make sure storage devices in the cluster are in good shape. Do routine health checks on the nodes in your cluster. 
  • Recover broken nodes quicker and add it back to the cluster.
  • Take snapshots or back up data which you really can not afford to lose or can be easily reproduced.
Big Data In Real World
Big Data In Real World
We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to deal with corrupt files in HDFS?
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X