The Wayback Machine: From Petabytes to PetaBoxes

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

The Internet Archive (www.archive.org) was created to build an Internet library that would provide researchers, historians, and scholars access to digital collections and information collected from millions of Web sites. Since its founding in 1996, the nonprofit organization has archived over 65 billion pages from 50 million Web sites worldwide, including text, audio and moving images. It also hosts and stores a number of digital collections, including a couple belonging to the Library of Congress.

Storing and protecting all of that digitized data — some two petabytes (or two million gigabytes) worth, compressed — from damage, degradation or destruction is critical to the Internet Archive’s mission. Additionally, because the Archive does not get rid of any data, it must constantly add storage and anticipate its long-term preservation needs.

Start with Commodity Hardware…

Over its 10-year history, the Internet Archive’s storage infrastructure has continually evolved. “We’re probably on the fourth generation of systems,” says John Berry, the Archive’s vice president of operations.

The Archive’s current storage architecture is a distributed system. As Berry explains, “you couldn’t fit all this on one machine. The Wayback Machine alone is about a petabyte of compressed data. So you’re kind of stuck using many machines. You also get some nice robustness by having a large number of machines.”

The Wayback Machine is a service that allows people to visit archived versions of Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web.

“When you have a lot of computers like we do and a lot of disks like we do, there’s always something that’s breaking,” says Berry. “So you want to have a system that’s resilient and allows services to operate in the face of degraded hardware. So we really didn’t have a choice about having many machines. Our approach is to use fairly low cost commodity-type hardware, so that we can scale very large at low cost.”

Throw in a PetaBox…

The Archive also makes use of a relatively new storage technology called a PetaBox, built by Capricorn Technologies (www.capricorn-tech.com).

“We wanted to have very large amounts of storage in the smallest space and using the least energy possible,” explains Berry.

So Capricorn developed a high density, low cost, low power, scalable, mass storage solution called a PetaBox (www.petabox.com), actually a family of products, specifically for — and with — the Archive.

“The PetaBox is a software system as much as it’s a physical entity,” explains Berry. “It will scale out to thousands of machines. And roughly, with the kinds of storage machines we use, you can fit a petabyte in 500 machines, give or take, depending on which disks you put into them. So anywhere in the 500 to 1,000 machine range you can get a petabyte in a PetaBox. Right now we have between 2,000 and 3,000 machines, organized into clusters. [A cluster includes a computer farm, catalog, monitor and storage/PetaBox.] They’re all managed as one entity. And that’s really the essence of what the PetaBox is: It allows us to manage 2,000 machines as pretty much one entity.”

The PetaBox system has dramatically reduced the Archive’s disk failure rates, and it is helping the Archive to keep power and administrative costs low. Each rack, which contains between 80 and 100 terabytes of data housed on approximately four disks, uses only 6kW. And each petabyte in the system only requires one system administrator.

Add Open Source Software…

As for the software running the system, it’s almost all open source. “Primarily now we’re using the Ubuntu (www.ubuntu.com) release of Debian (www.debian.org) for our OS,” says Berry. “It’s very easy to manage and install. We also use Linux, which we’ve used for many years in different flavors. And we use Apache and things like Perl and PHP.”

For the Archive, the decision to go with open source software was based on cost savings as well as experience.

“Obviously you don’t pay the big licensing fees,” says Berry. “But it also gives us a lot of openness and freedom, and the Archive is usually pushing some technical edge. So it’s nice to have that flexibility, which we wouldn’t necessarily have had with vendor software.”

And Archive for the Future

As Berry explains, when your goal is preservation, you “constantly need more and more [storage], because you’re not getting rid of anything. We just accumulate data, which means that if we have a couple of petabytes now, we’ll have 10 petabytes in a matter of years, and 50 and 100 and so forth.”

And unlike many large for-profit corporations, immediate access and rapid retrieval of data are not as high up on the Archive’s priority list as long-term, low-cost preservation.

“We do provide access, and try to make it as rapid as possible, but it’s not our primary objective,” explains Berry. “Our primary objective is to store and preserve data for all time.”

As part of this preservation process, the Archive built a monitoring system that constantly checks every single device in the system. And the Archive also makes a habit of rigorously checking the integrity of data as it migrates to new technologies or disks fail or data is copied. “The sense of permanence makes you more cautious and deliberate,” notes Berry.

In addition, the Internet Archive is in the process of giving copies of its main archives to the European Archive (www.europarchive.org) and to the Library of Alexandria, also known as the Bibliotheca Alexandrina (www.bibalex.org), in Alexandria, Egypt. The latter is also about to get a copy of the entire Wayback Machine. By maintaining copies of the Archive’s archives, these sites also function as disaster recovery sites.

In case you’re wondering how long it takes to copy an entire petabyte of information, Berry says about a week, with another week for verification.

“It’s not even super high-speed networking,” he says. “We use very conventional equipment. But we have a lot of experience moving things at multi-gigabit speeds. And again, because we use many different computers, individually they don’t have to work that hard.”

Things like copying and migration of data will get even easier this fall, as the Archive finishes moving into a single data center. Until now, the Archive had been using two different data centers in the San Francisco area, a consolidation from three not that long ago. With just one data center (still in San Francisco) handling all of the Archive’s storage needs, the system should be a lot simpler to run, says Berry, which is a good thing when you are adding dozens of terabytes of data each month to your storage system.

For more storage features, visit Enterprise Storage Forum Special Reports

Jennifer Schiff
Jennifer Schiff
Jennifer Schiff is a business and technology writer and a contributor to Enterprise Storage Forum. She also runs Schiff & Schiff Communications, a marketing firm focused on helping organizations better interact with their customers, employees, and partners.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends, and analysis.

Latest Articles

15 Software Defined Storage Best Practices

Software Defined Storage (SDS) enables the use of commodity storage hardware. Learn 15 best practices for SDS implementation.

What is Fibre Channel over Ethernet (FCoE)?

Fibre Channel Over Ethernet (FCoE) is the encapsulation and transmission of Fibre Channel (FC) frames over enhanced Ethernet networks, combining the advantages of Ethernet...

9 Types of Computer Memory Defined (With Use Cases)

Computer memory is a term for all of the types of data storage technology that a computer may use. Learn more about the X types of computer memory.