Storing the Universe One Particle at a Time
Some wonder if the massive particle physics experiment known as the Large Hadron Collider (LHC) will unveil secrets about subatomic particles or unknown dimensions. Others fear it could signal the end of the universe. But storage administrators have a completely different set of questions. They wonder how on earth they plan to store all that data.
"We have around 108 million files and 14 petabytes of data currently," said Sebastien Ponce, developments leader for the project. "That's up from 60 million files and about 7 petabytes of data in 2007."
And the experiment won't be in full swing until spring.
Built by the European Organization for Nuclear Research, known as CERN, the storage project is known as CASTOR CERN Advanced STORage manager.
"CASTOR is the software that will handle all the data coming from LHC," said Ponce. "This includes storing the data, redistributing them to Tier 1 centers and managing user accesses to the data. In total, CASTOR is handling a constant flow of 4-5 GB/s of data, with peak reaching 10-12 GB/s during days."
CASTOR employs a hierarchical storage (HSM) architecture, which includes:
- Three IBM TS3500 tape robots, one of them containing six S24 high-density frames and the other two with 16 frames (the maximum supported).
- IBM TS 1120 drives with 700GB of capacity, as well as IBM TS 1130 drives with 1 TB capacity per cartridge.
- Four Sun StorageTek SL8500 robots, three of them with a capacity of 10,000 cartridges and one with a capacity of 6,000 cartridges
- StorageTek T10000A drives with 500GB of capacity. StorageTek T10000B drives with 1TB of capacity are being run in a test environment prior to implementation.
CASTOR is now in its second iteration. The main difference between CASTOR 1 and CASTOR 2 is a complete rewrite of the disk cache layer that increased scalability.
"CASTOR 1 was limited to a few hundred thousand files in the disk cache, while CASTOR 2 is able to handle millions easily," said Ponce. "Throughput limitations from/to the disk cache have also been removed. Currently, we have demonstrated a 10GB/s constant over many days."
The main storage needs are related to the LHC and other particle accelerators. The Compass experiment, for example, has 4 PB of data stored in CASTOR, while others known as CMS and Atlas have has 3.8 PB and 3.1 PB, respectively. This includes storage of raw data coming from the accelerator detectors, as well as reconstruction and analysis of data, and a variety of simulations that compare real data to existing theories. Besides this, CASTOR is also open to users to store large files such as backups. But this only accounts for a few percent of the total data.
In terms of hierarchical storage management, CERN currently operates two stages. Disk is used for user access, with a back end of tape for mass storage.
"In the future, other hierarchies can be envisioned," said Ponce. "Perhaps we will make use of cold disks or flash memory."
Elements of CASTOR
CASTOR encompasses the following elements:
Oracle Database: Ponce calls this the heart of the system, as it hosts most of the decision-making code.
"Having such a data-centric architecture allows us to replicate all services for fault tolerance, and thus any node of the system can die with no major impact on the service," said Ponce. "We currently run Oracle 10.2.0.3 and a move to Oracle 11 is being considered."
Stager: This handles the disk cache, including migrations of data to tape, recall of data from tape, and garbage collection of the disk cache. It can handle hundreds of client file requests per second, running in parallel on two nodes for fault tolerance.
Name Server: It defines a global namespace for CASTOR data, in a UNIX-like way. It includes Access Control Lists for permissions. It is run in parallel on four nodes for fault tolerance and load balancing.
Volume Manager (VMGR), Volume and Drive queue manager (VDQM), remote tape copy suite (RTCP): These components are handling, respectively, the queue of tape requests, the queue of drive requests and the migration and recall of data to and from tape.
Disk Servers: About 800 are currently (of various hardware persuasions) providing about 8PB of disk cache space. Disk servers are grouped by pools that can be customized to different usages and restricted to given groups of users.
Tape systems/ Tape Drives/Tape Servers: According to Ponce, this is the primary storage of CASTOR, preferred to disk for its long-term stability and for its cost efficiency, but mainly for the extended experience accumulated on such systems (decades compared to years for disks).
Storage Resource Manager (SRM): A high-level interface defined by an international collaboration of people with interests in mass storage in order to unify the access to mass storage systems (see Storage Resource Management Working Group).
"This allows easy interaction between different mass storage solutions and allows the integration of these into the Grid, as storage elements," said Ponce.
Job Manager: It schedules access to the disk cache in order to guarantee maximum performance of the disks and avoid any overload.
"This also guarantees an optimum usage of tape drives by making sure that disks are able to receive/send the data when a tape is ready to read/write them," said Ponce.
CASTOR is free, open source software. All packages, source code and documentation are available at http://castor.web.cern.ch/castor/.