Gibson Discusses Learning From Storage Failures
Last October, EnterpriseStorageForum.com spoke with Garth Gibson the Carnegie Mellon University computer scientist who pioneered RAID technology and leads an initiative called the Petascale Data Storage Institute (PDSI) about the challenges of storing very large amounts of data. Chief among the challenges the PDSI researchers were exploring was why computers fail.
Now, a year later, we checked in with Gibson to see if PDSI is any closer to discovering why computers fail and to learn the latest developments in petascale storage.
Setting a New Standard
According to Gibson, who founded and serves as CTO of Panasas, a leading player in the large-scale storage space, "one of the most important things happening in storage in 2007 for large-scale users in both the enterprise and in scientific computing is the completion of the draft standard of Parallel NFS," which he said he expects to be delivered to the Internet Engineering Task Force at its December 2 meeting. "That will create a multi-source, competitive, standardized file system that can scale to the requirements of petascale systems," Gibson said. He also noted that until now there hasn't been an open standard for scalable file systems. "So it creates new opportunities for solutions."
The standard is the next generation of NFS, known as NFS 4.1. As its name implies, it was designed to replace release 4.0, "yet everything that's in 4.1 is optional," Gibson said, "so you can continue using [NFS 4.0] just like you've been using it and start to experiment with the new features."
For enterprises that require a high-performance, scalable storage system but are nervous about investing hundreds of thousands of dollars in proprietary systems that constantly need to be upgraded, Gibson said he believes the new open standard will ultimately yield a better return on investment.
Compensating for Media Faults
As for discovering why computers fail and trying to construct large-scale storage systems with lower failure rates that don't compromise on speed and yet aren't prohibitively expensive disk drive makers continue to make improvements as researchers have continued to make inroads.
"Companies are recognizing scale and they are doing things to improve the tolerance of their storage systems to more and more failures," Gibson said. Some of those things include faster repairing systems, massive parallel reconstruction of data, additional use of checkpointing and integrity codes and error correcting codes to protect against more types of failures, and double and triple disk failure-tolerant RAID. While Gibson is cautious not to describe any one of these as more important, he does point to an overall trend (albeit one he says is a couple of years old now) to provide more powerful correction mechanisms.
What is important and worth noting, Gibson said, is that drives in general are getting more reliable. The problem is, "the number of drives we're using in our systems even more reliable drives and the amount of data that we're putting on them and taking off them has grown so dramatically in the last 10 years, [and that's] adding up to more overall failures."
Of particular concern to enterprises and researchers is what's known as the media failure rate, also known as uncorrectable read errors or latent media faults, where one one-billionth of the surface of a disk becomes unreadable. While the problem does not occur that often (Gibson said the probability of running into a media fault is 1 out of every 10 to 100 terabytes read), it does occur from time to time, particularly when you're dealing with petascale systems, and can cause significant problems.
For example, Gibson describes this scenario:
"Let's say you had 14 disks in a RAID and one of them has failed. You now need to read the entire contents of 13 disks. And the drives could be a terabyte. That means you have to read 13 terabytes of data in that reconstruction. The probability of running into a media fault is quoted at somewhere between 10 and 100 terabytes read. So what that means is that during a reconstruction for the lower-quality drives, you pretty much expect not to be able to read the entire thing. There will be at least one sector you can't read. And even for the higher quality drives, one in 10 reconstructions is going to have this problem.
"If you are unable to read a sector during a reconstruction, even though you've only lost one one-billionth of the data, you've still failed the reconstruction. And file systems today have no actions you can take when you've failed a reconstruction. They simply go offline, and you then have to call the vendor [whose] technicians then have to try to figure out which sector it was and what they can do to try to repair it."
So why can't vendors further reduce, eliminate or compensate for latent media faults? The problem, Gibson said, is that "the marketplace expects more information for the same dollars." But to do that, manufacturers "have to pack the bits more closely together. And if they didn't have to return you the correct data, they could do that very fast. So the rate at which they fail to return the correct data is one of the limits that they face in how quickly they can increase the capacity and how well they can provide you with more data for the same dollars."
Vendors could tighten down the bits a bit more, "but if they did," Gibson said, "what would happen is the rate that they would fail to read data would go up. So they're tightening them down only to the point where that rate is just acceptable."
As a result, large-scale storage vendors, including Panasas (which is scheduled to make a big announcement in this area later this fall), are inventing new protection mechanisms to isolate the damage in the event of a media fault, which should be big and welcome news to enterprises. (To make an analogy, imagine those strands of twinkling lights you drape on a Christmas tree. Typically, at least in the past, when one little light failed, the whole strand went dark, until you found and replaced the burnt out bulb. Now imagine a strand where even if one little light fails, the rest keep twinkling. That is, in essence, what storage vendors are now attempting to do, albeit on a much, much larger scale.)
Learning From Failures
Another big development in the world of petascale data storage is the establishment of the Computer Failure Data Repository, where end users (so far just government-funded supercomputer sites) have been releasing their records on their failure data for others to examine and learn from. That information is important, Gibson said, "because the right way to improve the quality of systems is to really understand how they fail and target their actual failure mode."
Although computers have been around for a long time, "most computer scientists have had very limited knowledge [and] direct data on the failure mechanisms that occur in large collections of not just storage but computers," Gibson said.
Because vendors are often loathe or unable to share failure data, the Computer Failure Data Repository is courting and counting on end users, such as Los Alamos National Laboratory, Pacific Northwest National Laboratory, Lawrence Berkeley Lab and the National Energy Research Scientific Computing Center (NERSC), to make that information available. Los Alamos has already contributed a list of all of the failures the lab's 23 different clusters have experienced over nine years, which should prove helpful to researchers.
The hope, Gibson said, is that by uncovering why and how large-scale (and even smaller) computing systems or clusters fail, vendors can develop technologies that reduce or compensate for failures, allowing them to build bigger, faster, more efficient and reliable computers and storage systems at prices the market will bear.
Jennifer Lonoff Schiff is a regular contributor to EnterpriseStorageForum.com.