Gluster Brings Open Source to Unstructured Data
Gluster is an open source startup that most people in storage have never heard of. Yet its value proposition could spell trouble for the big boys and potentially send the prices of proprietary hardware crashing down.
Its flagship product, the Gluster Storage Platform, is aimed primarily at unstructured data. It combines open source software with commodity server hardware to deliver a low-cost platform that holds a whole lot of data. We are talking multiple petabytes, with the ability to scale up or down at will to meet capacity and performance requirements. It achieves this with pooling and virtualizing storage resources under a unified global namespace managed as a single entity.
The most growth right now is in unstructured data, said Ben Golub, president and CEO, of Gluster. This, he said, is driven by media, scientific and Web 2.0 applications.
Golub is the former CEO of Plaxo and recently joined Gluster because he saw the company as being uniquely positioned at the intersection of four key industry trends – open source, the need for scalable storage, the move to virtualized data centers, and cloud computing.
“We are seeing 80% growth annually in unstructured data,” said Golub. He believes new technologies on the immediate horizon like 3D seismic and the smart grid will add to that growth.
Golub said virtualization and the overall explosion of unstructured data have broken proprietary architectures. Traditional storage systems designed to deal with structured information like databases are expensive, don’t scale well and are focused on transactions. They weren’t designed to cope with millions of users, and petabytes of photos and videos.
Gluster’s CTO and co-founder, AB Periasamy, said many vendors have attempted to solve this problem. Over the past few years, he said, a large number of file systems have been pushed by various vendors and bodies: ZFS, Lustre, GFS, Spinnaker and others.
“Their efforts to create scalable file systems for NAS and unstructured data have resulted in complexity,” said Periasamy. “None of these systems achieved the desired level of scalability.”
He gave the example of NetApp Data OnTAP, which has a 16TB limit for a single volume. For data sets beyond that scope, you must create multiple 16TB volumes. Other approaches bottleneck on either a centralized or distributed metadata server (MDS) model. As the size of the data set escalates, the MDS struggles to search millions of entries to find the associated metadata. It reaches the point where it can’t keep up and in shared NAS environments with multiple applications, which can lead to corruption and reliability issues.
Enter Gluster. Rather than building bigger arrays containing bigger or faster drives, Gluster basically clusters a large number of cheap drives and presents them as one large virtual pool of storage.
“We virtualize storage resources under a unified global namespace that is managed as a single entity,” said Periasamy. “The resources can then be easily allocated to multiple users or groups.”
Periasamy said Gluster has resolved the scalability challenge by eliminating the need for the metadata server (MDS). Gluster’s algorithms circumvent the need for an MDS by allowing the Gluster Storage Platform to automatically knows where data resides.
“This removes the MDS as a bottleneck and as a single point of failure,” said Periasamy. “We were able to solve this problem correctly as we had a huge open source community around us.”
He said Gluster’s technology can be used by any organization deploying two or more servers with at least two or three terabytes of data. A free distribution of the software is available at www.gluster.org. But the company has a similar model to Red Hat with Linux in that it offers commercial support for the platform at Gluster. In other words, the software is free and you pay for tech support and other subscription services.
In addition, Gluster is about to release value-added modules. The modules will include better monitoring tools as well as integration with MapReduce analytics.
The Gluster Community is also adding value in terms of real-world deployments. Anything they develop for the platform goes into the open source pot.
“Our users are finding cases where we don’t perform well and are adding fixes and other features ahead of our roadmap,” said Periasamy. “Examples include tools to interface with storage running on Amazon and Rackspace, and a chargeback module.”
Gluster in Action
How is Gluster being harnessed in the real world? The ideal customer is one with large amounts of files that needs to scale out. Those with huge volumes of large and small files, those using the cloud, and users with many virtual machines (VM) are also great candidates, said Periasamy.
He noted another application used by those in real-time trading environments. In this use case, Gluster is paired with multiple solid-state drives (SSD) to create a virtual volume that appears as one large system across the network. On Gigabit Ethernet (GbE), for instance, 110MBps per Gigabit pipe, while a 10GbE interconnect means 1GBps per connection.
“With InfiniBand you get up 16GBps sec in throughput,” said Periasamy. “Gluster takes the drive bottleneck away as no single drive has to keep up, so that now the network can sometimes become the bottleneck.”
Another possible use case is to take away the need for sophisticated tiering. As other architectures don’t scale well, they try to make up for it by adding tiers of storage, said Periasamy. For most applications, Gluster provides faster performance with clusters of SATA drives than can be achieved on most networks with expensive drives, such as SSDs.
“Systems like those provided by EMC are great for database systems and processing of real-time transactions that tend to be lower capacity and have high IO random activity,” said Periasamy. “But such tools struggle with large amounts of unstructured data and that’s our focus.”
He concedes that Gluster is not a good match for certain use cases. Online Transaction Processing (OLTP), Oracle databases, Microsoft Exchange and SQL Server are a few examples.
He added that some customers have deployed Gluster on so-called green drives – those that spin down when idle – with mixed results. His conclusion is that such drives should not be used in primary storage. If used in a nearline capacity, however, the drives can be used successfully.