How to deal with corrupt files in HDFS?

Published by Big Data In Real World at May 17, 2021

Identifying corrupted files

Hadoop fsck (file system check) command is a great to inspect the health of the filesystem.

hdfs fsck / will give you a report like below which will help you check the health of the cluster and give you a count of the number of corrupt blocks but it doesn’t provide you with the list of files which are corrupted.

hdfs fsck / 
…..
…..
…..

Total size:    1943153298 B (Total open files size: 3422 B)
 Total dirs:    137
 Total files:   830
 Total symlinks:                0 (Files currently being written: 2)
 Total blocks (validated):      819 (avg. block size 2372592 B) (Total open file blocks (not validated): 2)
  ********************************
  UNDER MIN REPL'D BLOCKS:      1 (0.12210012 %)
  dfs.namenode.replication.min: 1
  CORRUPT FILES:        1
  MISSING BLOCKS:       1
  MISSING SIZE:         28160 B
  CORRUPT BLOCKS:       1
  ********************************

 Minimally replicated blocks:   818 (99.8779 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       818 (99.8779 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    3
 Average block replication:     1.997558
 Corrupt blocks:                1
 Missing replicas:              818 (33.292633 %)
 Number of data-nodes:          2
 Number of racks:               1
FSCK ended at Tue Nov 10 15:51:30 UTC 2020 in 366 milliseconds

hdfs fsck -list-corruptfileblocks will list the files with corrupted blocks

[hdfs@wk1 osboxes]$ hdfs fsck -list-corruptfileblocks

Connecting to namenode via http://ms1.hirw.com:50070/fsck?ugi=hdfs&listcorruptfileblocks=1&path=%2F
The list of corrupt files under path '/' are:
blk_1073742622  /ranger/audit/hdfs/20200813/hdfs_ranger_audit_ms1.hirw.com.log
The filesystem under path '/' has 1 CORRUPT files

Remove corrupted files

This step is straightforward once you know the list of files which are corrupted.

Issue a hdfs dfs -rm on all corrupted files.

hdfs dfs -rm /ranger/audit/hdfs/20200813/hdfs_ranger_audit_ms1.hirw.com.log

Avoid corruption

Corruptions are costly and hard to recover so take every possible step to avoid them.

Make sure storage devices in the cluster are in good shape. Do routine health checks on the nodes in your cluster.
Recover broken nodes quicker and add it back to the cluster.
Take snapshots or back up data which you really can not afford to lose or can be easily reproduced.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

Comments are closed.

How to deal with corrupt files in HDFS?

Big Data In Real World

What is the difference between explode and posexplode functions in Hive?

How to find duplicate elements or rows in a Spark DataFrame?

What is the difference between explode and posexplode functions in Hive?

How to find duplicate elements or rows in a Spark DataFrame?

How to deal with corrupt files in HDFS?

Identifying corrupted files

Remove corrupted files

Avoid corruption

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

What is the difference between explode and posexplode functions in Hive?

How to find duplicate elements or rows in a Spark DataFrame?

What is the difference between explode and posexplode functions in Hive?

How to find duplicate elements or rows in a Spark DataFrame?