MIT on Monday announced plans for a petabyte-scale IP storage system that can be “managed by a few graduate students,” in the words of MIT Media Lab Director Frank Moss.
The system, based on Zetera’s Z-SAN technology in collaboration with Bell Micro, Marvell and Seagate, will be used for the lab’s “Human Speechome Project” to collect and analyze video and audio data to better understand early childhood cognitive development.
Associate Professor Deb Roy has spent much of the last year compiling 12-14 hours of video per day of his nine-month old son in an effort to better understand early childhood learning and socialization data.
The digital audio and video data, collected at a rate of 350GB per day, will be processed and analyzed using a suite of data mining tools that Roy and his team have been developing. By mid-2008, the information will be assembled into a database exceeding one petabyte, to be processed and analyzed by several hundred parallel processing devices.
Zetera chief marketing officer Doug Glen said storage systems the size of the MIT system aren’t unique, but claimed, “It is unprecedented to build it with the simplicity, scalability and cost of the one we are talking about today.”
Requirements of the system include reads/writes in excess of 160 gigabits/second, shared volumes in excess of several hundred terabytes, scalability from an initial 50 terabytes to capacity well in excess of a petabyte, 100 percent data redundancy, file access by computers running multiple operating systems, a fully virtualized storage fabric, and affordability via low-cost, high capacity SATAhard drives.
Zetera senior product marketing director Jeff Greenberg said each component of the Storage over Internetworking Protocol (SoIP) system will access the system directly, with no RAID controllers to slow performance.
The system will ultimately be composed of more than 3,000 Seagate SATA drives, 300 Hammer Z-Rack storage enclosures, 100 Marvell-based 10G/GbE switches, and about 400 blade processors. It will process 700 terabytes of data during each 12-hour overnight analytical run. 150-drive stripes (aggregated virtual volumes) will be created using the native virtualization capabilities of Z-SAN. Protection against data loss will be delivered through RAID 10 mirrors (duplicate copies) of the raw video data, transform data and metadata files.
StorageIO founder and senior analyst Greg Schulz said the project is interesting — if a little overhyped.
“It is a cute little project, will get some press, however, it’s hardly something that has the storage vendors quaking in their boots over for now,” Schulz told Enterprise Storage Forum.
“Every few years someone does a science project like this to show how new technology and lower-cost technology can scale or change thinking,” Schulz continued. “It then typically takes a few years to productize and commercialize for turnkey business solutions even on a smaller scale to actually be delivered.”
Schulz took issue with calling the system an “array.”
“To say that a collection of nodes is an array is an unfair comparison to non-clustered solutions like those from EMC, HDS, IBM, etc.,” he said. “A more appropriate comparison would be how does this solution scale in capacity, performance, functionality and so forth when compared to peer, clustered, and grid-type solutions like those from EqualLogic, 3PAR, Isilon, Exanet, Panasas and many others.”