Preparing for Failure

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Around the globe, scientists are using blindingly fast, incredibly powerful supercomputers to model and predict important environmental events, such as global warming, the paths of hurricanes and the motion of earthquakes, and to envision more fuel-efficient cars and planes, ensure nuclear stockpile safety and develop new sources of energy. All of these simulations require the processing, sharing and analyzing of terabytes or petabytes of data. But what about storing and managing all that data?

To develop large-scale, high-performance storage solutions that address the challenges faced by the huge amounts of data that supercomputer simulations use and produce, the U.S. Department of Energy (DOE) recently awarded a five-year, $11 million grant to researchers at three universities and five national laboratories under the banner of the newly created Petascale Data Storage Institute (PDSI).

Part of the DOE’s larger Scientific Discovery Through Advanced Computing (SciDAC) project, PDSI’s five-year mission is to explore the strange new world of large-scale, high-performance storage; to seek out data on why computers fail and new ways of safely and reliably storing petabytes of data; to boldly go where no scientist or scientific institution has gone before — and to share its findings with the larger scientific (and enterprise) community.

Garth Gibson, founder and CTO of Panasas and the Carnegie Mellon and Berkeley computer scientist who pioneered RAID technology, is leading the PDSI effort.

“The first and primary goal of creating the Petascale Data Storage Institute was to bring together a body of experts covering a range of solutions and experiences and approaches to large-scale scientific computing and storage and make their findings available to the larger scientific community,” states Gibson.

The project’s second goal is to standardize best practices, says Gibson.

Performance Comes at a Cost

PDSI’s third goal: to collect data on computer failure rates and application behaviors in order to create more reliable, scalable storage solutions.

“As computers get a thousand times faster, the ability to read and write memory — storage — has to get a thousand times faster,” explains Gibson.

“As we build bigger and bigger computer systems based on clustering, we have an increased rate of failures. And there is not enough publicly known about the way computers fail.”

— Garth Gibson

But there’s a downside to greater performance.

“As we build bigger and bigger computer systems based on clustering, we have an increased rate of failures. And there is not enough publicly known about the way computers fail,” he says. “They all fail, but it’s very difficult to find out how any given computer failed, what is the root cause.”

While today’s supercomputers fail once or twice a day, once computers are built to scale out to multiple petaflops, or a quadrillion calculations per second, the failure rate could go from once or twice a day to once every few minutes, creating a serious problem. As PDSI scientist Gary Grider said in a recent interview: “Imagine failures every minute or two in your PC and you’ll have an idea of how a high-performance computer might be crippled.”

To learn more, PDSI scientists are busy analyzing the logs of thousands of computers to determine why computers fail, so they can come up with new fault-tolerance strategies and petascale data storage system designs that can tolerate many failures while still operating reliably.

As it makes new discoveries, PDSI will release its findings, through educational and training materials and tutorials it plans to develop. PDSI will also hold an annual workshop (maybe more), including one next month at SC06.

While the Institute’s findings will initially benefit the scientific supercomputing community, Gibson sees a trickle-down effect that will eventually reach enterprises.

“There is a whole commercial ecosystem around this,” says Gibson. “The same technology that is being driven first and foremost in the DOE labs [and now PDSI] shows up in energy research, the oil and gas [industries], in seismic analysis. … It shows up in Monte Carlo financial simulations for portfolio health. It shows up in the design of vehicles and planes. … It’s the same technology that’s used in bioinformatics for searching for proteins in genes. It’s almost the same technology that’s used for rendering computer graphics.”

So by the same token, the best practices, standards and solutions pioneered by PDSI in large-scale storage should eventually make their way into applications for the commercial sector. Already, IBM, HP, Sun and Cray (and no doubt other vendors) are busy working on solutions that address the challenges of large-scale storage. And as the scientists at PDSI uncover the reasons why computers fail and come up with new fault-tolerance strategies, vendors will be able to use that information to design storage solutions that can scale out even more while still providing the reliability that enterprises and institutions need and expect.

For more storage features, visit Enterprise Storage Forum Special Reports