Storing Decades of Bureaucratic Data
In a state-of-the-art building in Rockville, Maryland, rests our nation's bureaucratic history in electronic record form.
The National Archives and Records Administration (NARA) is in the process of sorting through millions of historically significant records created by the Federal government and its bureaus and agencies over the last thirty or more years, figuring out how to properly preserve these electronic and paper records in perpetuity while (eventually) making them available to the public.
Described as "the archive of the future," NARA's 10-year-old Electronic Records Archives (ERA) project is still a work in progress, which NARA hopes to complete in 2011. The goal of the project is "to enable researchers 50 or 100 years from now to find and retrieve electronic records using the best technology available to them, regardless of what hardware or software was used to create them," and to "make it easier for citizens to discover what records the government has and to access them," according to a statement issued by NARA.
Millions of Records, Terabytes of Data
Last summer, the ERA project achieved its first milestone, Initial Operational Capability (IOC), helping archivists manage the process of determining how long federal records should be kept by agencies and if those records should then be preserved in the National Archives, and giving NARA the ability to ingest and store electronic records in the format in which they were received. The successful completion of this first phase laid the foundation for the next four parts of the project.
As a result, NARA was able to start moving approximately 3.5 million historically valuable computer files from databases about World War II soldiers to the State Department's central files on foreign affairs into the ERA system. However, the project still has a ways to go another two or three years until it reaches Full Operational Capacity (FOC). Foremost among the challenges NARA currently faces is contending with those legacy electronic records while ingesting newer electronic records created by the Bush (43) Administration and completing the second increment of the project, building the ERA Search and Access System (SAS).
In many ways, NARA is waging a two-front war on electronic data, trying to tame and preserve the old while making room for the new. On one front, NARA must contend with 20 to 40 TB worth of "legacy" electronic records, computer files going back 30 years or so, much of which was stored on 3480 or 3490 data tape cartridges. To ensure that no electronic records are lost because of media degradation or obsolescence of format, NARA periodically migrates the data to more current media such as DLT-4.
On the other front, there are all the new (or newer) historically significant government-generated electronic files that need to be preserved, including the George W. Bush Administration's records (estimated at around 150TB), all of which, according to the Presidential Records Act, NARA is required to preserve in perpetuity. The Clinton Administration, by comparison, produced just 2TB of data, while the Obama Administration's reliance on technology will almost certainly result in more data than that generated by Bush 43.
"We do not know the exact number of files yet," noted Dyung Le, the director of systems engineering for the ERA project at NARA. But between the electronic records from the Bush Administration, the bulk of which is photos and e-mail, and all the other electronic records that NARA anticipates needing to store in the near future, Le anticipates the amount of data could be around 200TB. Right now, Le said, "the system can accommodate around 350TB. But we need to grow that up" to be able to store around 7PB.
Archiving and Active Storage
The problem, as with so many large-scale storage projects, has been the budget. In 2005, after awarding a $308 million six-year contract to Lockheed Martin and its team of archiving and data management experts, which included BearingPoint, Fenestra Technologies, FileTek, History Associates, EDS, Image Fortress and Science Applications International, NARA had its ERA budget drastically cut, causing Lockheed Martin to rethink the project and adjust the schedule.
"We could not afford to go out and do everything all at once," explained Le. "So we started out with a smaller system that was primarily disk-based." But the goal, he said, is to not have everything on disk, even though "it would make my life a lot simpler."
The problem with creating an all-disk system is that "the cost would be prohibitive," he said. "We do not just store one copy of the records. We need to store multiple copies [including offsite]. The Archives is very concerned about integrity. So everything has to be kept in triplicate, and everything is hash and checksum checked, and so forth."
Ultimately tape proved to be the most cost-effective solution, though NARA plans on having active data stored on disk. "Most likely, what you will see coming out is a tiered system, with disk-based storage for the more frequently accessed records and then tape backup behind that," he said. The problem is that it is hard to determine which records the public will want to access, "so it's a little bit hard for us right now to figure out what we need to put on disk."
An Open Storage Architecture
Whatever the ultimate configuration, the system will be generic, said Le, "because the goal of the system was to be scalable, open and evolvable. We didn't want to be locked into any one vendor's particular [architecture or features]," he said. "Because of that we use a lot of Sun equipment, because Sun is very generic."
In addition to Sun Microsystems (NASDAQ: JAVA), the project is also using equipment from Cisco (NASDAQ: CSCO), NetApp (NASDAQ: NTAP), Hitachi and EMC (NYSE: EMC). NARA uses the Hitachi Content Archive Platform and a Hitachi hierarchical storage system (HSM) that manages Sun, Hitachi and EMC hardware. EMC supplies the SAN, which Le said is used only as a transfer device, and its importance in the overall architecture is minimal.
"We use a lot of things, but in a very generic way," Le said, "because we did not want to tailor our architecture to one particular vendor's approach."
Despite budgetary problems and delays, however, the project is still on track to be substantially (if not entirely) completed in 2011, and the four pilot agencies the U.S. Patent and Trademark Office, the National Oceanographic Office, the National Nuclear Security Administration and the Bureau of Labor Statistics have already been trained on the ERA system, and NARA expects them to start using it some time this year.