Tips for Better Deduplication
Deduplication can now be regarded as a mature technology. It’s been around for a decade and in that time has become an integral part of backup and storage processes. But that doesn’t mean the technology has become static. Deduplication strategies have changed and the products have evolved – either to provide more flexibility, ease of use or improve results.
Here are some tips for more effective deduplication:
All of the Above
Rob Emsley, Senior Director of Marketing, EMC Data Protection and Availability Division, noted that the deduplication algorithms themselves have not changed that much lately, but where and when deduplication takes place has evolved.
Go back a few years and there were competing deduplication approaches vying for dominance. Some said it was best to deduplicate at the data’s source, others argued that you should do it at the target device. Some stressed doing it inline while others insisted that deduplication should only be done after another process like backup took place (post processing).
So who was right? It turns out there are use cases for all of them. The market response to which method is best has shifted to “all of the above.” Often, several of them are combined to boost efficiency.
“You no longer see constant discussions of source versus target or inline versus post processing,” said Emsley.
The reason is that deduplication vendors have developed tools that allow for a variety of approaches, depending on the workload. EMC, for instance, has developed its Data Domain product line so deduplication can take place on a client, within the backup infrastructure or just on the deduplication storage system. That enables a reduction in the storage consumed for backups, as well as cutting down the amount of backup data that needs to move across the network.
Deduplication used to be a little clunky. It was an add-in to other actions – perhaps a new element interjected into the backup process. These days, it is either embedded into backup software, or appliances are available that are largely plug and play.
This has given rise to the Purpose Built Backup Appliance (PBBA) market. In 2013, the PBBA market produced revenues of $3.1 billion, according to IDC.
“PBBAs continue to be the preferred solution for deploying deduplication storage systems,” said Emsley.
It’s often noted that there can be some variance in deduplication ratios, i.e. the ratio of the amount of data you started with and what it ends up at once all the duplicates are removed. One trick to improve the ratio is to vary the block rate. A given type of data may have a relatively poor dedupe ratio with one block size but an excellent ratio when a smaller block size is utilized.
“Select a system that deduplicates in variable versus fixed bock sizes to maximize deduplication ratios,” said Emsley.
Bill Andrews, CEO of ExaGrid, takes things a step further. He sees a challenge in implementing variable blocks in that you might end with a “hash table” that keeps track of deduplication blocks that grows to be very large. As an example, at a block size of 8 KB, a 10 TB backup environment will have 1 billion entries in the hash table. This can lead to additional consumption of controllers and disk shelves.
Andrews recommended zone-level deduplication, where larger zones are compared to find the changed bytes within the zones. This approach allows for the use of virtually any backup application and as data volumes increase, the backup window stays fixed in length which can eliminates the need for forklift upgrades.
Virtualization and deduplication technologies arrived on the landscape at a similar time and both quickly gained user acceptance. Casey Burns, Product Marketing, DXi, Virtual and Cloud Solutions, Quantum, saw a relationship between the two – the proliferation of virtual machines (VMs) leading to an even greater need for deduplication. Therefore, in most cases, it is vital to implement deduplication technology that fully equipped to deal with virtual environments.
“We’ve had customers say that without deduplication, the savings they were realizing from virtualization were being eroded by the explosive growth of VMs,” said Burns.
Separate Out Data Types
However, not all data is amenable to deduplication. So if you try to deduplicate everything, a) your overall ratios could suffer b) you could slow certain processes down and c) hike up overall costs.
“Deduplication ratios often fell short of user expectations, because not all workloads are deduplication-friendly,” said Gartner analyst Pushan Rinnen.
“Knowing what data is well-suited to deduplication and what data isn’t makes a world of difference to the efficiency of your data workflow,” said Casey. “As more corporate data consists of unstructured content – such as video or satellite imagery– it is increasingly important that data is separated from the dedupe data flow.”
Photo courtesy of Shutterstock.