Open Source Deduplication: Ready for Enterprises?


Want the latest storage insights?

Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure

In just a few years, data deduplication has gone from a technology with a lot of promise that only very large enterprises could afford to one that is nearly ubiquitous for making the most of backup and recovery.

Dedupe has become so essential that data storage vendors have been shelling out millions — even billions — to acquire deduplication technology, as was the case last summer, when EMC (NYSE: EMC) acquired Data Domain for $2 billion.

Now we're seeing the next evolution in data deduplication technology, open source deduplication, with several established open source storage vendors (Bacula, Nexenta, Sun/Oracle and Zmanda), as well as some new players like Opendedup, challenging proprietary solutions and literally giving away the technology.


In March, Opendedup, a new open source deduplication solution, made headlines when it debuted. A deduplication file system for Linux also known as SDFS, Opendedup was designed for enterprises with virtual environments looking for a high-performance, scalable, low-cost deduplication solution.

According to developer Sam Silverberg, "the design goal of SDFS was to leverage the performance and scalability benefits provided by object-based file systems with the storage optimization available with deduplication." The results: Opendedup/SDFS can dedupe a petabyte or more of data; supports over 3 TB per gigabyte of memory at a 128K chunk size; performs inline deduplication at a speed of 290 MB/s; has high aggregate I/O performance; supports VMware (and Xen and KVM) and can dedupe at 4K block sizes. And did we mention it's free?

Opendedup/SDFS also only takes about 20 minutes to set up on a standard Linux system, said Silverberg, and no compiling is necessary. "SDFS volumes are mounted and created like any Linux file system, and the commands should be familiar to anyone who has ever mounted a volume on a Linux system," he said. Moreover, for those who need a little help, there is a quick start guide and a detailed administration guide on the Opendedup Web site. But can anyone (anyone, that is, with a Linux system) use and benefit from Opendedup?

According to Silverberg, any organization that heavily leverages virtualization ("SDFS can deduplicate hundreds of virtual machines across shared or distinct SDFS volumes ... and can spin up new VMs and clone existing ones very quickly"), or is looking for a storage-efficient, disk-based backup system ("an SDFS volume can be presented for disk-based backup and provides storage savings and I/O benefits"), or needs to archive lots of data ("SDFS volumes can be presented as NAS shares ... and unstructured data can be copied and archived to SDFS volumes as third-tier storage") will be able to benefit from Opendedup/SDFS.

But is Opendedup/SDFS truly an alternative to proprietary solutions?

"SDFS has performance, scalability and cost advantages over many proprietary solutions, but I think proprietary solutions have some real technical benefits," said Silverberg. "Replication, source-based deduplication, and 24/7 phone support are not available today in open source solutions."

SDFS is a file system, "which makes it easy to implement as a storage device," but it's also "harder to get deep integration into solutions such as backup and hypervisors without hooks into proprietary APIs," said Silverberg.

However, he added, "if an organization is looking for raw performance, scalability and deduplication from a file system, SDFS is the way to go." And clearly many enterprises are, as the first week alone Opendedup.org had over 14,000 unique visitors, many of whom downloaded the software.


Open source network backup and restore software vendor Bacula Systems is also climbing on the open source deduplication bandwagon.

"In most enterprises, the total amount of storage in use is increasing at a very rapid rate, something like 40 percent per year," said Kern Sibbald, founder of Bacula.org and the CTO of Bacula Systems. "So to keep up with this increasing volume of storage to be backed up, we needed to make our backup programs faster and more efficient." And one way to do that is by introducing deduplication.

"Within Bacula [version 5.0.0], we have implemented something that we call Base jobs, which allow the user to control which files will be considered for deduplication," he said. "This is our first step into deduplication, and it is a file-level deduplication rather than a block-level deduplication."

Sibbald noted that some storage analysts refer to Bacula's deduplication solution as SIS, or Single Instance Storage, but that Bacula refers to it as file-level deduplication.

"The advantage of what we have done is that it is relatively simple to implement compared to other duplication techniques, and it does deduplication on tape and disk equally well and very efficiently," he said. "In addition, there is very little extra overhead during restore, contrary to some of the block- or byte-level deduplication techniques being used."

That said, Sibbald admitted that Bacula's been experimenting with both block and sliding block deduplication techniques, and that one or both may very well be included in a future release of the software.

As for Bacula 5.0.0, the response has been impressive, said Sibbald. "It was by far the release with the most downloads within a few days of the initial release," he said, though he couldn't say how much of that was attributable to the inclusion of data deduplication, as the release included other new features.

Page 2: Zmanda and Nexenta

Submit a Comment


People are discussing this article with 0 comment(s)