De-Mystifying De-Duplication

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

Even as the price of storage continues to come down, the amount of data enterprises need to store and back up continues to go up. But what if there was an enterprise solution that could free up storage space by a factor of 10 to 1 or even 20 to 1, making disk-based backup even more affordable? Therein lies the promise of de-duplication.

As Curtis Preston, vice president of data protection at GlassHouse and a leading expert on backup and storage, sees it: “Everybody with a significant amount of data should be at least examining de-duplication.”

But when it comes to actually buying a de-duplication solution, Preston isn’t ready to make a recommendation. With a lot of players — many of whom are just making their de-duplication solutions generally available — offering a lot of different approaches to de-duplication, what might be a great solution for one company may not be the best choice for another, he says.

So here is a guide to help you cut through some of the clutter.

The Basics

While de-duplication is an attractive proposition for every size organization, due to the equipment needed and costs involved, it really only makes financial sense for those enterprises with at least of couple of terabytes to backup.

Also, prospective customers need to understand that “like compression, your mileage may vary,” explains Preston.

“Generally speaking, 10 to 1 is a pretty safe estimate,” he says, adding that “there are many people who use 20 to 1. But not everybody is going to get that. What you actually end up getting is going to depend highly on how you do backups and what kind of data you’re backing up.”

As for how and what, that depends largely on whether you are an organization with a lot of remote sites and/or mobile employees or whether you are looking at deduping at a centralized data center. You can find de-duplication solutions for both approaches, but one solution does not fit all.

Targets, Replacements Square Off

Just as with optical disk storage, where you have Blu-ray and HD DVD duking it out for industry domination, the de-duplication battlefield is divided into two main camps. On the one side you have “the targets,” as Preston describes them. These are generally the makers of virtual tape libraries (VTLs), whose de-duplication solutions act as an adjunct to an enterprise’s existing backup software. Leading players in this space include Diligent, Data Domain, Sepaton, FalconStor and Quantum. Across the battlefield are “the replacements,” vendors like Avamar (now part of EMC), Asigra, Symantec, Atempo and TimeSpring, which promise to dedupe data at the source, essentially replacing the need for backup software.

Then things get more complicated. “On one level, they do it all the same: they’re all looking for redundant blocks of data and they’re trying to get rid of them,” says Preston. “But they all use a different way to accomplish that, and they absolutely will have differences in aggregate performance and single stream performance on backups and then restore performance of recent data and older data.”

In general though, Preston sees the target players, the VTL vendors, as good options for data centers because their solutions are designed for high performance and they reduce the amount of disk space necessary to store data. And most easily fit in with your existing infrastructure, causing few (if any) disruptions to operations (see Getting a Grip on Bottomless E-Mail).

However, because the cost of installing a VTL solution at an enterprise with many remote offices can be expensive, with the cheapest dedupe VTLs costing $20,000 per location (not including backup software, another cost), the “replacement” backup software solutions can ultimately be more cost-effective for those companies with remote offices. That’s why, for example, the Virginia Department of Motor Vehicles, which has 73 remote offices, went with Avamar’s Axion solution, which helped the DMV eliminate its need for tape backup servers and libraries at its remote sites.

That said, de-duplication backup software solutions don’t work for everything. “If you’re going to try to back up a 50 terabyte Oracle database with any of these data dedupe backup software products, forget it,” says Preston. “The restore performance will not be acceptable to most environments for really large individual [say 20 or 30 terabyte] servers.”

If you are a large enterprise with remote offices and a central data center to back up, you probably need both types of solutions, says Preston, who doesn’t see that as such a bad thing.

“If I was a NetBackup shop and I really liked Avamar, which is an EMC product, I would have no problem running NetBackup for my data center and Avamar for my remote offices and having two completely separate products, because the remote office is such an incredible pain … that I will do anything to solve that problem,” says Preston.

Migrating from Tape to Disk

De-duplication solutions are also helping organizations make the move from tape to disk, by shrinking the amount of backup disk storage space needed to house data, making disk more affordable for primary and secondary storage.

“The movement of disk into the backup infrastructure has really helped accelerate the adoption of our technology,” explains Jed Yueh, the founder of Avamar and now senior vice president of product management at EMC Avamar. “Customers today are of the mindset that they should be looking at disk and adding it to their data protection infrastructure. The reason why we really help that transition is because we dramatically reduce the amount of data you need to store on disk, thereby converting the economics of disk versus tape media for backup and recovery.”

A case in point, St. Peter’s Health Services, a 450-bed hospital in Albany, New York, had been wanting to make the move from tape-based backup to disk for a while — and finally purchased a disk-based backup solution in 2004. However, the conversion was going badly, to the point where Curt Damhof, St. Peter’s network manager, was considering going back to his tape vendor. Then Damhof came across Avamar’s Axion de-duplication solution and decided to test it out.

“It did what they said it would do,” says Damhof. “I don’t do any tape backups anymore. Everything’s on disk. And because of the efficiencies [of the Axion solution], I’ve been able to keep everything on disk. It’s affordable enough to let me do that. A lot of the other solutions I was looking at, because they didn’t do the deduping, you could write stuff to disk, but then you still had to write it to tape because there was too much data to keep everything on disk for a long period of time.”

Do Your Homework

“De-duplication provides two key ingredients that solve major enterprise IT challenges today,” says Yueh. “The first is it can really diffuse the explosion of data that’s happening when you transition from primary to secondary backup and recovery … not only within the data center, but at all remote sites, where data is growing out of control. And the second is it allows you to really transform archaic IT processes and bring them into the 21st century.”

But before you go out and plunk down tens of thousands of dollars on any de-duplication product, ask each vendor for a test drive, just as you would any big ticket item.

“You don’t just go, hey, BMW says they can go 150 miles per hour, I’ll just buy that,” says Preston. “If that’s important to you, you get in the car and you take it up to 150 miles per hour.” In other words, don’t believe everything the salespeople tell you.

“Customers will get thrown all sorts of crap: forward referencing, reverse referencing … 16-bit hash versus private hash versus universal hash versus … all these different things that in the end don’t matter. What matters is what kind of dedupe ratio do I get on my data, because one approach could work well for me but not well for another guy, and how much does it cost?”

You also need to check for several things: aggregate performance and single stream backup performance, for recent as well as for older data, “and restores of recent data and restores of old data, both big restores as in a whole bunch of them at once, and single restores of new data and old data and how fast that goes,” says Preston.

Not every product does everything or functions at the same level of performance for every task. So ultimately you have to decide what is really important to your organization.

For more storage features, visit Enterprise Storage Forum Special Reports


Jennifer Schiff
Jennifer Schiff
Jennifer Schiff is a business and technology writer and a contributor to Enterprise Storage Forum. She also runs Schiff & Schiff Communications, a marketing firm focused on helping organizations better interact with their customers, employees, and partners.

Get the Free Newsletter!

Subscribe to Cloud Insider for top news, trends, and analysis.

Latest Articles

15 Software Defined Storage Best Practices

Software Defined Storage (SDS) enables the use of commodity storage hardware. Learn 15 best practices for SDS implementation.

What is Fibre Channel over Ethernet (FCoE)?

Fibre Channel Over Ethernet (FCoE) is the encapsulation and transmission of Fibre Channel (FC) frames over enhanced Ethernet networks, combining the advantages of Ethernet...

9 Types of Computer Memory Defined (With Use Cases)

Computer memory is a term for all of the types of data storage technology that a computer may use. Learn more about the X types of computer memory.