In just a few years, data deduplication has gone from a technology with a lot of promise that only very large enterprises could afford to one that is nearly ubiquitous for making the most of backup and recovery.
Dedupe has become so essential that data storage vendors have been shelling out millions — even billions — to acquire deduplication technology, as was the case last summer, when EMC (NYSE: EMC) acquired Data Domain for $2 billion.
Now we’re seeing the next evolution in data deduplication technology, open source deduplication, with several established open source storage vendors (Bacula, Nexenta, Sun/Oracle and Zmanda), as well as some new players like Opendedup, challenging proprietary solutions and literally giving away the technology.
Open Source Deduplication Solutions
In March, Opendedup, a new open source deduplication solution, made headlines when it debuted. A deduplication file system for Linux also known as SDFS, Opendedup was designed for enterprises with virtual environments looking for a high-performance, scalable, low-cost deduplication solution.
According to developer Sam Silverberg, “the design goal of SDFS was to leverage the performance and scalability benefits provided by object-based file systems with the storage optimization available with deduplication.” The results: Opendedup/SDFS can dedupe a petabyte or more of data; supports over 3 TB per gigabyte of memory at a 128K chunk size; performs inline deduplication at a speed of 290 MB/s; has high aggregate I/O performance; supports VMware (and Xen and KVM) and can dedupe at 4K block sizes. And did we mention it’s free?
Opendedup/SDFS also only takes about 20 minutes to set up on a standard Linux system, said Silverberg, and no compiling is necessary. “SDFS volumes are mounted and created like any Linux file system, and the commands should be familiar to anyone who has ever mounted a volume on a Linux system,” he said. Moreover, for those who need a little help, there is a quick start guide and a detailed administration guide on the Opendedup Web site. But can anyone (anyone, that is, with a Linux system) use and benefit from Opendedup?
According to Silverberg, any organization that heavily leverages virtualization (“SDFS can deduplicate hundreds of virtual machines across shared or distinct SDFS volumes … and can spin up new VMs and clone existing ones very quickly”), or is looking for a storage-efficient, disk-based backup system (“an SDFS volume can be presented for disk-based backup and provides storage savings and I/O benefits”), or needs to archive lots of data (“SDFS volumes can be presented as NASshares … and unstructured data can be copied and archived to SDFS volumes as third-tier storage”) will be able to benefit from Opendedup/SDFS.
But is Opendedup/SDFS truly an alternative to proprietary solutions?
“SDFS has performance, scalability and cost advantages over many proprietary solutions, but I think proprietary solutions have some real technical benefits,” said Silverberg. “Replication, source-based deduplication, and 24/7 phone support are not available today in open source solutions.”
SDFS is a file system, “which makes it easy to implement as a storage device,” but it’s also “harder to get deep integration into solutions such as backup and hypervisors without hooks into proprietary APIs,” said Silverberg.
However, he added, “if an organization is looking for raw performance, scalability and deduplication from a file system, SDFS is the way to go.” And clearly many enterprises are, as the first week alone Opendedup.org had over 14,000 unique visitors, many of whom downloaded the software.
Open source network backup and restore software vendor Bacula Systemsis also climbing on the open source deduplication bandwagon.
“In most enterprises, the total amount of storage in use is increasing at a very rapid rate, something like 40 percent per year,” said Kern Sibbald, founder of Bacula.organd the CTO of Bacula Systems. “So to keep up with this increasing volume of storage to be backed up, we needed to make our backup programs faster and more efficient.” And one way to do that is by introducing deduplication.
“Within Bacula [version 5.0.0], we have implemented something that we call Base jobs, which allow the user to control which files will be considered for deduplication,” he said. “This is our first step into deduplication, and it is a file-level deduplication rather than a block-level deduplication.”
Sibbald noted that some storage analysts refer to Bacula’s deduplication solution as SIS, or Single Instance Storage, but that Bacula refers to it as file-level deduplication.
“The advantage of what we have done is that it is relatively simple to implement compared to other duplication techniques, and it does deduplication on tape and disk equally well and very efficiently,” he said. “In addition, there is very little extra overhead during restore, contrary to some of the block- or byte-level deduplication techniques being used.”
That said, Sibbald admitted that Bacula’s been experimenting with both block and sliding block deduplication techniques, and that one or both may very well be included in a future release of the software.
As for Bacula 5.0.0, the response has been impressive, said Sibbald. “It was by far the release with the most downloads within a few days of the initial release,” he said, though he couldn’t say how much of that was attributable to the inclusion of data deduplication, as the release included other new features.
Zmanda, which is based on Amandaopen source backup and recovery software, has likewise begun to include deduplication in its software.
“We are pursuing both source-level [on the backup client] and target-level [on the storage media] deduplication,” said Chander Kant, the CEO of Zmanda, who noted that Amanda has already been tested and certified with several target-level deduplication technologies, including EMC’s Data Domain and Oracle/Sun ZFS.
“Deduplication potentially saves a lot of system resources for Zmanda customers,” he said. “And we are seeing very good compression ratios.” Moreover, the deduplication is transparent to end users.
As with Opendedup and Bacula, the response to the inclusion of open source deduplication on the target side in Amanda has been positive, said Kant — and he sees more businesses, especially small and medium-sized companies, jumping on open source deduplication solutions “that can stretch their limited IT budgets by saving on storage costs.”
As for open source storage solution vendor Nexenta Systems, it incorporated ZFS-based inline deduplication in the latest version of its storage solution, NexentaStor 3.0, which was released at the end of March. And Nexenta claims that not only is NexentaStor 3.0 the first storage solution to offer inline deduplication for primary storage, but that open source solutions like ZFS are technically superior to proprietary ones.
“We were extremely impressed with ZFS inline deduplication — and convinced that it is the best deduplication technology available on the market,” said Evan Powell, the CEO of Nexenta Systems.
Indeed, when asked to compare how NexentaStor stacked up against the competition, Nexenta claimed that customers that used NexentaStor typically experienced a 75 percent cost savings versus proprietary solutions, in large part because of the increased efficiency through compression.
As for NexentaStor’s target market, that would be enterprises with large virtual environments such as Microsoft Hyper-V, Citrix Xen and VMware, including hosting and cloud service providers, research and development organizations and businesses with virtual desktop environments.
Standards Favor Open Source
So putting the hype aside for a minute, are open source deduplication solutions really as good or as reliable and scalable as proprietary solutions?
“Proprietary solutions are expensive and the source code is not available, so it is not easy to check or compare their performance,” said Bacula’s Sibbald. “From the deduplication statistics that I have seen from proprietary vendors and those given by open source projects such as lessfs, I would say that the open source solutions stack up very well against the proprietary solutions.”
Added Zmanda’s Kant: “Over time, deduplication will become standard. Just like we have standard algorithms for compression today, there will be standard algorithms and formats for deduplication. And open source shines with standardization. So the future of deduplication is squarely with open source.”
Follow Enterprise Storage Forum on Twitter