I was at a preservation archive workshop hosted by a U.S. government agency recently — these are installations that must preserve information in a digital format forever — when it occurred to me that files need to be much better protected to make these kind of archives a reality.
The workshop was held to discuss some of the challenges facing preservation archives, which cannot change once documents are converted from their original analog format to digital.
Most of the industry participants agreed that over time, digital data will inevitably experience bits that flip (0 becomes 1, or vice versa). Over time, a bit or two or even more will flip or be read incorrectly, and the file might become unreadable or corrupted to a point where it is not usable. It is one thing to lose a single bit in a file, but if you lost a bit in the wrong part of a file definition, often called an application header or file header because it is at the beginning of the file, if that was lost or unreadable, the whole file might be lost.
A participant from the film industry mentioned that even film that is 100 years old and not perfect can usually be displayed, and most of the film is viewable and clear enough to the average person. The participant asked why digital file formats (jpg, mpeg-3, mpeg-4, jpeg2000, and so on) can't allow the same degradation and remain viewable. A great question, and one that no one had an answer for.
Headers and File System Superblocks
A vendor friend of mine who came out of the simulation industry told me that back in the 1970s, a project he was working on used two file headers for each of the simulation file outputs they were generating. That got me thinking about how file systems write multiple copies of the superblock, which essentially performs the same function as a header on a digital file. Almost every file has a header, from all Microsoft application files to digital audio and video to the data used to create your weather forecast, your car, or the airplane you fly on. The superblock for a file system allows the file system to be read, understood and processed, and if you lose a disk where the superblock is located, the file system tries to read other disks and determine if there is a valid copy of the superblock.
The need for this is no different than the header on an audio or video digital format or any other file. While it's only part of the problem, what if a file could be written with multiple headers and the application knew where to look for the headers? You'd want headers at different ends of the file because if a sector is corrupted, having two headers in the same sector likely won't help.
So let's say you have a header at the start of the file and then at the end of the file; how do you figure out which one is good? The obvious answer is to create a checksumof the header data and compare the checksums, which means that you will have to read the header and validate the checksum. Another way would be to add ECC (error correction code) to the header so minor corruptions can be corrected. This method is what happens for many telecommunications systems and on the Space Shuttle, which is called voting: Read three or more headers and compare the headers to see which two or more have the same results. I think the addition of ECC to the header is the most attractive option for a number of reasons:
- You are reading less data
- You are seeking less in the file, as you are only reading a single header and not multiple ones
- ECC allows the failure to be both detected and corrected, and is therefore more robust than using methods that validate checksums
- Today we clearly have the processing power to both validate and correct the header if the ECC fails
The drawback is that if the sector where the header is located is corrupted badly, you likely won't be able to reconstruct the file, so two headers plus ECC should be the solution for the most critical files. The advantage of having multiple headers protects against sector failure, compared to protection against a few bits or multiple bits flipping, depending on how much ECC you use.
Another feature of this method is that the headers could be large and maybe even padded to a full disk hardware sector, which is currently 512 bytes but might change to 4096 at some point in the future.
Precedents from Broadcast Industry, Dedupe
ECC methods have been around for decades; it is time to start using this technology so that files aren't lost over bit errors. With all the compression technologies in use, often the loss of a single bit can mean the loss of a whole file. How many times have you opened a digital picture at home only to find that it is unreadable?
Back in the 1990s, there were a few RAIDcompanies that ignored errors on read for the broadcast industry. These companies did this because if you are playing a commercial for the Super Bowl, it is better to lose a few bits in replay than to not be able to play the commercial and lose millions in revenue. Very often the few bits that were lost were not even noticed. The broadcast industry has long known that it is better to lose a few bits than to lose the opportunity to send the broadcast down the wire. The problem we have today is that with the need for compression algorithms for pictures, video and audio, the loss of a few bits has a much more dramatic impact than it did when streaming uncompressed formats back in the 1990s.
Some applications create a per file checksum, but that does not correct a problem in the file; it just tells you the file has been changed, and when you cannot display something or it looks weird, that's kind of obvious anyway. I am a big fan of compression, and maybe what we need to do is take some of the lessons learned from the data deduplicationindustry. Many of the data deduplication products have ECC that is able to correct every block. The amount of ECC varies from vendor to vendor, but maybe what is needed for ECC for pictures of your mother-in-law might be different from the ECC required for a preservation archive of the U.S. government.
I surely want more ECC on my IRS records than on one of my underwater photos, although I am more than willing to give up disk space, CPU, memory bandwidth and time to process the ECC. It would be nice to have a way of setting and resetting the amount of ECC as needs change, but file formats still need a way of displaying a file even if there is a failure on one of the ECC areas. I should not have to lose a whole file if the ECC block for one of a fish's eyes is bad, using my underwater photo example. Just show me the fish and I will figure out what to do in Photoshop.
The way I see it, file and data integrity have to change or we will eventually lose all of our long-term archived data. The potential cost to industry and governments around the world — and the threat of the loss of our shared history — mean that we have to do something. Everything is going digital, from medical records to old movies, photos and documents. The current methods may have worked in the past, but they won't work in the future.
Henry Newman, CTO of Instrumental Inc. and a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
See more articles by Henry Newman.
Follow Enterprise Storage Forum on Twitter