Dealing With Data Corruption In HDFS

Hadoop is designed to store and analyze huge volume of data and with huge volume of data stored in HDFS the chances of data corruption increases. In this blog post we will discuss how HDFS deals with corruption in HDFS.

Why data could get corrupted in HDFS?

Data could get corrupted during storage or processing the data. In an environment like Hadoop data (blocks) gets transferred over the network quite often and there is always a chance of data corruption during transmission. There is a also a high chance of data getting corrupted due to failures in hard disk.

How does HDFS detects corruption?

HDFS detects corruption using checksums. Think of checksum as a unique value that is calculated for a piece of data using the data itself. When ever data comes in to HDFS, checksum is calculated on the data. Now blocks in HDFS are usually validated during read, write and at periodic intervals against the checksum that was initially stored with the checksums that are calculated on the data at any given time. If the checksum is different during any of the checks then the block of data is deemed corrupted.

When does HDFS verifies checksums?

When the data is written to HDFS, HDFS calculates the checksums for the data and all datanodes storing the data are required to verify the data against the checksum that is initially calculated with the checksum that is calculated before they are stored on the datanodes to make sure the data is not corrupted during transmission. This way we can be sure that the data is intact during transmission.

When a client requests to read the data from HDFS, the client is responsible for verifying the checksums against the data that is returned by datanodes with the check that will be calculated by the client. This way we can be sure that the data is not corrupted when it was stored on disk in the datanodes.

In addition to verifying the data during read and write to HDFS, datanodes also run a background process called DataBlockScanner which scans the blocks stored in the datanode for any possible corruptions due to issues with the hard disks.

How does HDFS fix corrupted data?

This is very simple. HDFS is built ground up to handle failures. By default, each block in HDFS is replicated on 3 different nodes across the cluster. So when a block corruption is identified HDFS simply arrange to copy a good block from one of the replicated nodes to the node with the corrupted block.

Big Data In Real World

We are a group of Big Data engineers who are passionate about Big Data and related Big Data technologies. We have designed, developed, deployed and maintained Big Data applications ranging from batch to real time streaming big data platforms. We have seen a wide range of real world big data problems, implemented some innovative and complex (or simple, depending on how you look at it) solutions.

1 Comment

Datanode Block Scanner - Hadoop In Real World says:

September 8, 2015 at 8:33 pm

[…] Previous Datanode Block Scanner […]

Dealing With Data Corruption In HDFS

Can Reducer always be reused for Combiner?

Datanode Block Scanner

Can Reducer always be reused for Combiner?

Datanode Block Scanner

Dealing With Data Corruption In HDFS

Why data could get corrupted in HDFS?

How does HDFS detects corruption?

When does HDFS verifies checksums?

How does HDFS fix corrupted data?

Big Data In Real World

Related posts

Sunset: Hadoop Developer In Real World cluster

How to recursively delete files, folders or bucket from S3?

Hadoop In Real World is now Big Data In Real World!

1 Comment