The assignment seemed easy at first: find out whether bit error bugs (known as silent corruption) are a serious problem, and then find someone who’s had trouble because of them.
Unfortunately, anyone who has suffered because of this situation isn’t likely to step forward. This might be a function of the types of environments likely to experience this kind of situation: financial, healthcare or other types of high reliability, very high throughput systems that move and store enough data to get hit by an occasional error.
A PowerPoint on the issue by Emulex Corp. reveals a few incidents:
- A well-known e-commerce company was forced to shut down for days when a bug in the file system manager caused bad data to be written onto the database.
- A large telecom company had its database corrupted when a faulty disk adapter wrote garbled data onto the database.
- A leading financial services company experienced repeated corruptions when a problem in the virtual memory system caused the wrong data to be written onto the database.
- A large manufacturer incurred repeated data corruptions over a period of many months that were attributed to a faulty interrupt handler.
Such failures are sometimes reported in vague terms in investor announcements due to “system failures,” but fingers are rarely pointed in public. But they clearly do happen. Why else would Oracle, Emulex and others in the storage industry be investing in technology to limit customer exposure to data corruption?
Stopping a Silent Data Killer
Many people confuse the issue with “bit rot” — i.e., leave a hard drive on a shelf or a computer unused for a year or two and some of the electrons can flip from positive to negative, causing corruption and failures. Silent corruption, on the other hand, is about errors with the data being sent to the drive. The term “silent” comes from the fact that such errors may not be noticed for months — everything seems fine until it comes time to access that block or backup the data.
Fibre Channel (FC) and SATA, of course, utilize data checking techniques and correction algorithms to handle errors. But many of these schemes were evolved decades ago when channel speeds were less than a tenth of what they are today — around 25 Mbps compared to over 400 Mbps these days. Further, there are a lot more channels than people envisioned back in the day. The result is that the coding isn’t robust enough to handle modern-day circumstances.
What are the implications? They can range from losing a file to the entire file system or database becoming corrupted. A couple of scientists have gotten in on the act, alerting the world to the issue. Computer scientist Vijayan Prabhakar has issued a workaround on the problem in a paper called “Iron File Systems.” Peter Klemenen, an IT specialist at CERN, the European Organization for Nuclear Research, is also working on the problem for SATA disks.
“Silent corruptions are a fact of life,” Klemenen said. “The first step towards solution is detection, though elimination seems impossible.”
Accordingly, the vendor community has been working on the T10 DIF (Data Integrity Field) standard for complete end-to-end data integrity checking for enterprise storage systems (see Storage Vendors Pledge Data Integrity). However, this only works with FC and SAS — and it’s SATA that has the highest error rates.
Its scope, therefore, has been expanded by the Data Integrity Initiative (DII). It is the brainchild of Oracle, Emulex, LSI Corp. and Seagate Technology.
“DII’s focus is a class of data corruption unprotected by checksums used in communication networks,” said Williams. “This class of data corruption derives mostly from software or firmware errors. But hardware faults, as well as human-induced errors, can also be a cause for data corruption. The approach DII advocates is an end-to-end approach and is an extension of the T10 DIF standard.”
Williams admits that while data corruption occurrences may be rare, the impact can be devastating. DII, then, is really a combination of T10 DIF with an earlier Oracle initiative known as HARD — Hardware Assisted Resilient Data. DIF operated from the HBA down to the disk drive, whereas HARD went from the HBA up the application. The work on DII is intended to cover the application to spindle gamut.
“Our initial efforts are in the Linux domain because of its strategic nature to Oracle,” said Williams. “Its open nature also allows Oracle to innovate in this area in ways that would not be as easy with another operating system.”
Richard Vanderbilt, a senior alliance manager at LSI, expects Oracle to be first out of the gate with a T10 DIF product — the Oracle Storage Manager (OSM) file system is having T10 DIF embedded for Linux systems. Emulex, he said, should also be releasing an FC HBA soon, while LSI will initially add the technology into its high-end arrays by the end of 2008.
“DII will minimize silent data corruption that might otherwise go undetected for months,” said Vanderbilt. “That prevents problems like backing up faulty data for months and only finding out when you go to do a restore that it is corrupted.”
This tends to happen between devices and storage subsystems. Data might be reported as having been written to one block yet is written to the adjacent block, an end user error might result in an index field being overwritten, or a kernel memory caching issue might mean that certain bits are not written when they should have been.
“T10 DIF is not a silver bullet, but it will solve most silent data corruption issues,” said Vanderbilt. “By validating the data at every point, we can minimize silent corruption.”
SNIA Steps In
To achieve its aims, Oracle has to open up some code to the open source community and also gain the backing of vendors such as Red Hat and Novell. Meanwhile, LSI is confident of bringing Sun, Microsoft and IBM into the fold to encompass Windows, AIX and Solaris.
Accordingly, the Storage Networking Industry Association (SNIA) has been approached and has approved a DII task force to bring the initiative outside the scope of the four vendors who originated it.
“Currently, the four companies are moving to open up this effort through SNIA and bring in other companies,” said Williams. “It is also hoped that through SNIA we can broaden the solution to other platforms and applications.”
But SNIA’s involvement doesn’t mean that everyone should panic about the issue of data integrity. The chances of winning the lottery are something like 60 million to one, depending on the numbers involved (there’s seven zeroes in that number). Without any kind of DIF, HARD or DII initiative, the chances of failure go into the quadrillions. Reed-Solomon encoding and DIF pushes it to somewhere around 1 in 10 to the power of 17 or beyond. So we are not talking about an everyday occurrence.
But like everything else, high-volume, high-availability systems are at risk from such tiny possibilities. In a large enterprise operating with sustained transfer rates of 100 GB/sec, that adds up to almost 300 errors in a year. And for a heavy duty network running at 1 TB/sec, it adds up to nearly 3,000 errors.
But Marty Czekalski, an interface and emerging architecture programs manager at Seagate who attends T10 meetings, believes it isn’t much of a problem in any case.
“Modern FC drives are great at finding these issues, and SATA drives are rapidly getting to that point too,” he said. “So undetected data corruption in the disk drive is almost unheard of, perhaps as high as one in 1030.”
DIF, he said, takes care of items outside of the disk drive such as at the switch, RAID controller, HBA or virtualization in the fabric.
“DIF’s biggest advantage is that it catches rare errors before they are written to disk drives,” said Czekalski. “Older methods only caught errors after the fact of corruption.”
Seagate plans to incorporate DIF in its next-generation FC drive by the end of this year. SATA drives will follow later.