Library of Congress Readies New Digital Archive
It's been more than a year since the Library of Congress selected Government Micro Resources (GMRI) to build a multi-tiered petascale storage system, using technology provided by Sun Microsystems, for the Library's new state-of-the-art National Audiovisual Conservation Center (NAVCC) in Culpeper, Virginia (see Sun Rises at Library of Congress).
The archive was to serve as the digital repository for the more than 1.1 million moving image items (including films, newsreels, television programs and advertising material), nearly 3 million audio items (including commercial sound recordings, radio broadcasts and early voice recordings of historical figures), and more than 2.1 million supporting documents (such as screenplays, manuscripts and photographs) belonging to the Library's Motion Picture, Broadcasting and Recorded Sound (MBRS) division, as well as any new media MBRS acquired after the archive went into production.
EnterpriseStorageForum.com recently caught up with members of the Library's Information Technology Services (ITS) team overseeing this massive archiving project to get a progress report and to find out what advice the team had for other institutions and enterprises considering a similar undertaking.
Know Your Speeds and Feeds
Although the archive a multi-tiered, mostly tape-based solution comprised of Sun Fire x64 servers and Sun storage running the Solaris 10 Operating System has yet to go into production, the ITS team is confident it will perform as required, eventually storing up to 8 petabytes a year of rich media. That's because, said team members, the project was done right, right from the start.
"There was a lot of planning and investigation and discussion [leading up to] the RFP," said Sarah Gaymon, one of the project managers (see Storing National Treasures). As a result, the Library had a very good sense what its requirements and expectations were for the system, and knew that whichever vendor or integrator it chose would know and understood them too.
Indeed, according to Thomas Youkel, the Library's information technology specialist on the NAVCC archive project, knowing what the archive's long-term storage requirements were upfront was critical and he advised other institutions and enterprises contemplating deploying large-scale storage systems to do the same.
"You need to know what your speeds and your feeds are," explained Youkel. "You need to know what your throughput requirements are. You need to know what your ingest rates are. You need to know how much information you're going to be pushing at your archive. Those are the key things.
"And you need to know those things ahead of time, because in order to build and test a system like we did, that's supposed to push 8 petabytes a year, we needed to know how much data that was every day [32 terabytes], what our Fibre Channel looked like, what our backbone needed to look like, and what was the minimum acceptable set of hardware that could push that data," he said.
As for the "speeds" part of the equation, "if you're writing to tape," said Youkel, which is what the Library is doing, "you need to know how fast you're writing to tape, particularly if you're dealing with 8 petabytes a year."
Testing all components is also critical, said Youkel.
"When a vendor or an integrator proposes a solution, you have to be able to test that solution, to make sure that what they're specifications say will work when the environment is put together will work," he explained. "So not only are requirements a key, but a testing plan for those requirements is also key.
"We had a set of benchmarks and performance tests that the equipment needed to meet and/or exceed, and the equipment did indeed meet and exceed those performance and benchmark requirements," said Youkel, who declined to elaborate further.
In fact, the archive system has gone through quite a bit of performance testing and re-testing in the past year, having been initially assembled and tested at the Library's Madison Building on Capitol Hill last fall and winter and then tested again this summer after being painstakingly disassembled and then reassembled at the NAVCC some 60 miles south in Culpeper, Virginia. In addition to making sure that the system had been properly reassembled, the ITS team needed to make sure the archive system could communicate with the NAVCC network and communication facilities as well as with the disaster recovery site.
As of this writing, the system was still undergoing stress testing, and was anticipated to go into production this month, after which point, the front end which Library Services and MBRS, the archive's sole user, have been developing, designing and building in parallel (with design help from Ascent Media, integration and installation help from Communications Engineering Inc. and software from the Gustman Group) will be integrated and the whole thing re-tested.
Maintain, Verify and Migrate
While it has taken dozens of people to prepare the archive at least 15 ITS employees, plus technicians and engineers from GMRI and Sun, and that's just on the back end/archive portion once the digital archive system is fully operational, the Library anticipates it will need only three ITS staff to work on site in Culpeper, at the NAVCC, to maintain it. As part of the maintenance process, ITS will continually verify the data and will migrate it to progressively higher density storage as it becomes available to ensure the lasting preservation of our nation's audiovisual history.