Web 2.0 has become quite a buzzword in the storage industry in the last year, with titans and upstarts alike pledging to develop storage systems suited for fast-growing, collaborative environments. But before we see what these storage systems will look like, let's first take a look at what type of data is suitable for them.
We'll call it Web 2.0 data for lack of a better term. It's different from traditional, transaction-based data in both nature and use. It comes in large files, typically created by a single user, and may be shared over some geographic distance. Much of Web 2.0 data is what you'd expect from the name: images, video, and e-mail archives, for example, but the category has also come to include volumes of information from surveillance camera footage, geospatial mining data, genomic sequences and financial analysis scenarios.
File-based Web 2.0 data is just as important as a company's transactional data and requires similar degrees of availability, security and protection from loss. Like traditional corporate data, Web 2.0 data is expanding only more so.
To cope with the growth of Web 2.0 data, companies are adopting a storage technology developed by Web pioneers like Google (NASDAQ: GOOG) and Yahoo (NASDAQ: YHOO). Borrowing from high-performance grid computing, this approach to storage uses large racked clusters of compute and storage nodes made up of fairly inexpensive industry-standard servers and drives. The data is distributed and duplicated over multiple nodes, often geographically separated. The storage component is CAS or NAS, using SATA or SAS drives.
To lower cost, power consumption and cooling costs, nodes are optimized with only the features required for the application. Less expensive than blades, cluster nodes are denser and without redundant power supplies and fans. Redundancy is at the node level, and the clustering software handles node failures transparently, providing both resiliency and the flexibility. Such clusters are more-or-less self-managing and scale up quickly.
Depending on hardware configuration and the software you install, clusters can be compute-intensive for HPC tasks or more storage-oriented, providing the equivalent of a huge NFS cloud with a single name space.
Companies like Google and Yahoo built and still build their own custom infrastructure. Google orders huge quantities of custom motherboards directly from Intel to fit its low cost and power consumption requirements. (If Google were a system manufacturer, it would be in the top five.) However, you don't have to build your own custom Web 2.0 storage infrastructure. Increasingly, mainstream storage companies are developing products and services to do this for you.
Design to Order
Dell (NASDAQ: DELL) was one of the first companies to provide Web 2.0 infrastructure. Its Data Center Solutions Division announced Cloud Computing Solutions in March 2007. Through this program, Dell designs, provides, and even installs racks of servers and storage for clustered service or storage delivery, optimized for your application (and low power consumption). There are even maintenance and rental options.
According to discussions on Dell's In The Clouds blog, this service is for large orders (1500+ nodes) and you must provide your own clustering software. Dell is not providing the off-the-shelf systems that it sells to the public, but has developed systems designed specifically for clustering applications.
Sun Microsystems (NASDAQ: JAVA) and Rackable Systems (NASDAQ: RACK) are also in the Web 2.0 business. In addition to offering racks of compute and storage nodes suitable for clustering, both companies are notable for offering mobile data centers packaged in storage containers. Sun's Modular Datacenter S20, for example, sits in a 20 foot-long shipping container with only single power, network, and water hookups.
Water cooling allows these units to be denser and more power-efficient than a similar number of nodes in a typical air-cooled data center. The main attraction is getting massive amounts of storage or computing power going in a short time. Again, you must provide the clustering software to tie it all together, although Sun last year acquired the Lustre clustered file system and is bringing it into its Open Storage project.
Space and power consumption have become big data storage issues, particularly for Web-scale data centers. IBM's (NYSE: IBM) April introduction of the iDataPlex Web 2.0 server system directly addresses these concerns. By rotating a standard 42U rack 90 degrees about its vertical axis landscape instead of portrait and fitting in two side-by-side stacks of half-depth nodes (15 inches front-to-back), IBM can shoehorn up to 84 CPU nodes into the space usually occupied by 42, with 16U of lateral space left over for switching hardware. For storage applications, there are 3U units that supply one CPU and 12TB of hard drive storage, for a maximum 336TB per rack with 28 nodes.
The sideways twist is even more important for reducing power consumption. The distance the fan units must push air to cool the nodes is half what it normally is, and since the relation between cooling distance and fan power is non-linear, the drop in power required is much more than half. More efficiency comes from using fewer, larger fans. Pluggable four-fan units cool eight nodes. According to Gregg McKnight, distinguished engineer and vice president at IBM Modular Systems Development, the fans consume approximately 6 watts per server. For data centers with maxed-out air conditioning systems, iDataPlex can take an optional water-cooled heat exchanger that provides a net cooling effect.
According to McKnight, "Companies buying lots of nodes want them just the way they want them."
Though not as customizable as Dell's cluster systems, IBM provides 22 different node variations (processor, I/O slots, memory and storage) with several supply options to better match power to application need. IBM can supply either Linux or Windows to run the Intel-based nodes, and also provides the clustering capability with the Nextra software it acquired when it bought XIV.
As a result, IBM can provide "a compute cluster optimized for space," said McKnight. "The entire solutions is pre-built, cabled and tested, allowing the customer to bring it up in minutes."
HP recently introduced a clustered system purely for storage, the HP StorageWorks 9100 Extreme Data Storage System (ExDS9100). The ExDS9100 combines HP C-class blades running Linux, several 82-drive storage blocks, the PolyServe clustering file system, and management software that "treats the ExDS9100 as a single big blade," said Ian Duncan, director of NAS marketing for HP.
"It's scalable NAS for less than $2 a gigabyte," said Duncan. "File-based storage is where the growth is for 90 percent of the companies HP is talking to."
ExDS9100 is dense (12TB per U) and extremely easy to scale. It uses blades for compute units, but you don't need a lot of blades for the amount of storage supported because the drives aren't directly attached to the blades. The unit takes from one to four, four-blade performance blocks, and up to 10 82GB RAID 6 storage blocks (ranging from 246TB to 820TB in capacity).
You can scale both capacity and, for CPU-intensive storage applications like on-demand video, performance. "A newly plugged-in performance block is detected and initialized in literally seconds," asserted Duncan. ExDS9100 provides access to other systems via both via NFS and HTTP protocols and multiple storage systems can be linked together with PolyServe.
Duncan sees three types of customers needing Web 2.0 storage infrastructure. The first are pure Web 2.0 companies with a business model that delivers services or content over the Web. Second is existing traditional enterprises dealing with pockets of content explosion in their own corporate data. Life science companies sequencing genomes may produce hundreds of TB in a week. Third is traditional enterprises wanting to do SaaS. A good example is HP's own Snapfish on-line photo storage service, which has served as a proving ground for ExDS9100.
Still To Come
EMC (NYSE: EMC) has made announcements in the Web 2.0 storage area, and although details are yet to come, EMC's stature in the storage industry gets attention. In addition to the Fortress SaaS storage platform announced in January as the infrastructure behind its Mozy backup services (but not as a product itself), EMC has also been discussing since last year two products with code names "Hulk" and "Maui." Hulk may be a clustered NAS hardware system, and Maui is purported to be clustered file system software of a "global" scale. But users will have to wait for details as EMC's strategy evolves.