Silent Data Corruption, the Backup Killer
Data corruption is simply an unintentional change to a bit. An occasional bad bit or unrecoverable read error is unlikely to take down an application or render a restore useless. However, corrupted data is not uncommon.
When data corruption goes undetected, it becomes silent data corruption and is a high risk for applications. And when they creep into backup and remain undetected, you have a real data integrity and restore problem on your hands.
Hardware and software both introduce errors into the data path. On the hardware side, head failures, noisy data transfer, electronic problems, aging and wear can introduce bit errors. And with a nod to 1950s science fiction movies, cosmic rays can cause DRAM soft errors (memory bit flips.)
On the software side, coding bugs can damage data integrity in the OS, file system, firmware, and anywhere else where data exists in the computing stack.
So How Big a Problem is It?
An older CERN study marked their average as one silent error in every 1016 bits; more recent studies came in at similar averages. NEC reported on silent read failures that go unreported on disk arrays even with data integrity checks. The bad data then is written to the application, introducing a variety of errors from a bad record to a failed application.
NetApp studied over a million and a half production disks over nearly a year. They identified over 400,000 silent data corruptions, about 13% of the total data under study. Error checking technology identified 370,000 of these – a good average but with 30,000 remaining undetected errors. NetApp's testing software caught them when the verification process did not, but in a production environment those 30,000 errors would have stayed on disk and entered the backup system, remaining there until needed for a restore – and failure.
The problem is not related to or solved by larger disk capacities: error rates have not significantly changed. This means that the much larger amounts of data stored on high capacity disk is correspondingly at more threat for silent corruption. . In modern disks the 1/1016 error count is multiplied many times over because these disks store much more data.
Let's take a database backup. DR is in place for this important data with an RPO of 15 seconds. The database crashes, you go to restore your fresh backup – and find out that the corruption exists within the backup and has been there for over 3 days. 3 days worth of near-continuous, and now corrupted, backup.
And don't think that backing up to the cloud is going to magically solve the problem. Your backup is going to your providers' SSDs and hard drives, which are subject to the exact same error rates as any other storage medium.
Cloud services provider eFolder speaks actively about guarding against silent data corruption in the cloud. They suggest that when speaking to online cloud storage vendors, you ask about this very issue, including what technology the provider uses on their storage media and when backing up on the cloud site. Amazon S3, for instance, runs checksums to preserve data during network transfers and at rest.
Data Protection and Integrity Checks
The first way to protect backup integrity is to keep errors from entering backup storage in the first place. I already mentioned ECC and CRCs; other vendors have gone farther to protect the IO stream. For example, EMC Isilon OneFS is designed for big data verification within the file system and when sending over the network interface. Protecting data at the disk array level keeps errors from entering backup in the first place. Hyperconverged vendor Nutanix also runs silent data integrity checks to fix corrupted bits before they reach the hypervisor.
Other file systems with native end-to-end checksumming and integrity checking are open-source ZFS developed by Sun, and Microsoft’s Resilient File System (ReFS). In the IO path, error correcting codes (ECC), and cyclic redundancy checks (CRCs) will catch the majority of errors. RAID types that run checksums will also help to catch errors. RAID commonly protects storage arrays, while data protection vendor Unitrends physical appliances have RAID 6 with its two checksums.
Fixing errors in the IO stream protects applications, storage media and backup. Because if too many corrupted bytes slip through to backup, your last line of defense against data loss is also corrupted. If that line fails, then you will never be able to restore a good copy.
Save the Backup
Another step is to choose a backup product that checks itself against corruption introduced within the backup environment.
Most backup has recovery detection built-in. However, recovery testing and assurance does not automatically mean data integrity assurance. Among the storage vendors who do offer data error checks as well as recovery assurance are Intronis, Asigra, Veeam and Unitrends.
Intronis Cloud Backup and Recovery uses a local Safe Catalog to verify file integrity before launching backups or restores. Using a verified copy, Intronis scans backup copies residing at each of its remote data centers to verify data integrity across Intronis storage. Intronis automatically replaces a corrupted backup with a verified copy.
Asigra built-in data integrity checks into Enterprise Backup. The process runs automatically in the background and monitors backup for completion and integrity. In case of corruption, Asigra locates a good original and restores corrupt files.
On the VM backup front, when Unitrends transfers data off-site from an on-premise appliance, it creates a checksum of the original file. When the subsequent deltas are written to the vault, UEB runs a checksum to verify that the new data is a perfect match to the old.
Veeam SureBackup verifies backup data integrity and recoverability. Its full scan operation checks for common errors such as bit rot, and replaces the corrupt data with verified data from the source backup.
Photo courtesy of Shutterstock.