Can Reducer always be reused for Combiner?
September 1, 2015Datanode Block Scanner
September 8, 2015Dealing With Data Corruption In HDFS
Hadoop is designed to store and analyze huge volume of data and with huge volume of data stored in HDFS the chances of data corruption increases. In this blog post we will discuss how HDFS deals with corruption in HDFS.
Why data could get corrupted in HDFS?
Data could get corrupted during storage or processing the data. In an environment like Hadoop data (blocks) gets transferred over the network quite often and there is always a chance of data corruption during transmission. There is a also a high chance of data getting corrupted due to failures in hard disk.
How does HDFS detects corruption?
HDFS detects corruption using checksums. Think of checksum as a unique value that is calculated for a piece of data using the data itself. When ever data comes in to HDFS, checksum is calculated on the data. Now blocks in HDFS are usually validated during read, write and at periodic intervals against the checksum that was initially stored with the checksums that are calculated on the data at any given time. If the checksum is different during any of the checks then the block of data is deemed corrupted.
When does HDFS verifies checksums?
When the data is written to HDFS, HDFS calculates the checksums for the data and all datanodes storing the data are required to verify the data against the checksum that is initially calculated with the checksum that is calculated before they are stored on the datanodes to make sure the data is not corrupted during transmission. This way we can be sure that the data is intact during transmission.
When a client requests to read the data from HDFS, the client is responsible for verifying the checksums against the data that is returned by datanodes with the check that will be calculated by the client. This way we can be sure that the data is not corrupted when it was stored on disk in the datanodes.
In addition to verifying the data during read and write to HDFS, datanodes also run a background process called DataBlockScanner which scans the blocks stored in the datanode for any possible corruptions due to issues with the hard disks.
How does HDFS fix corrupted data?
This is very simple. HDFS is built ground up to handle failures. By default, each block in HDFS is replicated on 3 different nodes across the cluster. So when a block corruption is identified HDFS simply arrange to copy a good block from one of the replicated nodes to the node with the corrupted block.
1 Comment
[…] Previous Datanode Block Scanner […]