When Bits Go Bad
Data protection and backup get a lot of attention, and rightfully so, but silent data corruption, or "bit rot," can wreak as much if not more havoc and be tougher to correct (see My Totally Excellent Data Corruption Adventure — Not! and Keeping Silent About Silent Data Corruption).
One of the best and latest examinations of the problem of data corruption and bit rot is a study by CERN, the world's largest particle physics laboratory based in Switzerland. CERN's Peter Keleman outlined a number of possible solutions that storage professionals should take note of.
What does all this have to do with you, the end user? Well, just a few months ago, I ran into a problem while traveling that I suspect was caused by data corruption. As a consultant, I travel a great deal and am a member of a number of hotel, airline, car rental and other travel companies to book reservations and get points.
Anyway, one morning in August I logged onto the Web site early of an unnamed travel company and booked some reservations for the following week. I then tried logging in a few hours later and my password no longer worked. I figured the Web site was down and tried again after lunch, but ran into the same problem. I called the reservation site help line and was told that they were unaware of any problems, so I called customer support and they told me they could e-mail me my password.
I received my password and found to my dismay that something had gone terribly wrong. My password ended with ()!@, which had been changed to(]!@. I started to fear that someone had gotten my password and changed it and was using my credit cards (or even worse, my points). I called the company again and asked for second line internet support. I asked them when the last time was that my password was changed. They told me a year earlier, which meant identity theft wasn't the problem.
A Question of Character
I decided to investigate the matter further, in part out of professional curiosity. I figured it had to do with the character set. I assumed it was not an IBM mainframe and the system was using ASCII (American Standard Code for Information Interchange), not EBCDIC (Extended Binary Coded Decimal Interchange Code). The first thing I did was go to an ASCII character conversion table. What were these two characters from ASCII — ) and ] — converted to various numeric formats?
I have suspected for a few years that data can get corrupted after seeing at least three unexplainable corruptions in large environments. The set of slides from CERN confirmed my fears. Most of the disk drives used at CERN are SATA drives, from what I've learned. Was this the cause of my corrupted password? Clearly, of the seven bits, four of them had been flipped.
At that point, I figured I needed someone in third line support at the company. It was late in the day, but I got through first line support in a breeze when I started to talk about ASCII encode and bits flipping, and within minutes I was on to a second line support person, who understood what I was saying but thought I was nuts: Why should I care if they had corrupted my password when the problem was fixed? It took some persistence, but I finally got a third line support person. I explained that I was a storage consultant and outlined the problem as I saw it, which turned out to be interesting timing: he had just received a call from some else who was also a computer consultant who had had the same problem, except this person's password had been changed to a different character. Two people with the same problem on the same day.
I suggested to the third line support person that they might want to check the disk channels and the hard drives associated with the passwords, and I asked if they would mind e-mailing me with the results and if the drive types were SATA or Fibre Channel. Naturally they agreed, and of course I never heard back, other than a routine customer service survey. It would have been great to find out the real cause of the problem, but companies are understandably reluctant to release such information.
I will never know the real cause of my password corruption, why the other person's password was also corrupted, or just how widespread the problem was. Just like I said after my home PC data corruption, I believe that data can become corrupted, and there is currently limited protection in the data path in case of corruption.
There is a new standard being implemented from the T10 group called Data Integrity Field (DIF) (see Storage Vendors Pledge Data Integrity), which passes a checksum byte from the SCSI driver to the disk (potentially the application). This effort and Sun's ZFS file system seem to be about it for data corruption efforts, but neither of these technologies comes without some baggage. There is a limited understanding of these types of corruption, and tracking these problems down is just plain hard work. When you are having this kind of problem, there is a lot of pressure to find and fix it immediately, but often you just start replacing parts and never really find out what was broken and why and how the corruption really happened.
I am more convinced than ever that data corruptions are going to happen and there is nothing we can really do about it with current technologies. With the explosive growth of data and the global reach of data networks, we need new robust error encoding throughout the entire data path, from iPods to my favorite travel site. But the question is, are we willing to pay the price? Error encoding will reduce performance and increase cost. I am ready to pay the price for increased reliability. Are you?
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 27 years experience in high-performance computing and storage.
See more articles by Henry Newman.