Data duplication first captured the imagination of the public almost 20 years ago when Data Domain (now Dell/EMC) launched its “tape is dead” campaign and released the early deduplication appliances.
Data deduplication involves maximizing storage utilization while allowing organizations to retain more backup data on disk for longer periods of time. This helps raise the efficiency of disk-based backup, lower storage costs, and change the way data is protected. It accomplishes this by comparing new data with existing data and eliminating redundancies since only unique blocks will be transferred. This reduces replication bandwidth.
Here are five top trends in deduplication:
1. Software Rather than Appliances
The standard for many years was to buy two deduplication appliances — one as a backup copy, which could replicate copied data to the other box.
This solution certainly has value, but it can prove to be clunky as data volumes rise. When that happens, restores can be slow, and costs can spiral upwards.
Some deduplication is now software-only, avoiding the need for pre-built appliances. Such software can run on virtual machines. Hardware solutions continue to exist, too, and have valid applications.
2. Flexibility
An argument raged for years as to where and how best to accomplish data deduplication. Some said it had to be done at the source, and some said it was better to do it at the target. Similarly, some insisted it should be done inline, while others said it should be done offline. There are tradeoffs for each approach.
Modern duplication systems have evolved that allow the user to select the type of deduplication that best meets organizational requirements. Products are available that give choices such as inline, concurrent, or post-process deduplication.
“If your focus is on the storage or backup, the locational preference for compression and deduplication will be at the target,” said Greg Schulz, an analyst at StorageIO Group.
Inline deduplication helps to minimize storage requirements and tends to be most suitable for smaller storage configurations as well as replication environments. Concurrent deduplication doesn’t have to wait for backup jobs to complete. Post-process, or offline, deduplication decouples dedupe from any backup processes. Backup data is written to temporary disk space prior to deduplication. This is a good way to shorten backup speeds, as they are unaffected by deduplication workloads.
But as data sets grow larger, storage admins are challenged with silos that make it difficult to store and manage data.
A Reed Solomon data layout technique and data lake provides customers a petabyte-scale capacity with high-speed inline deduplication and compression of data to eliminate duplicates. More speed may be needed and inline deduplication may be the answer.
“Inline deduplication delivers great benefits to reduce the space usage on disk,” said Brian Henderson, director of product marketing for unstructured data storage at Dell Technologies.
“The combination of inline compression, inline deduplication, and post-process deduplication on a wider pool of storage helps lower storage costs, while increasing storage efficiency.”
3. Global Deduplication
As noted, deduplication used to be about transferring data between two boxes and comparing one to the other. Modern systems can now perform the same function globally.
Falconstor, for example, provides high-performance backup and recovery that includes deduplication to shorten backup windows, optimize capacity, reduce storage costs, and minimize WAN requirements. Its StorSafe product receives backups from many sources, then chunks, hashes, and reduces them, so it can store them cheaply on-premises or in cloud object storage.
“Only globally unique blocks of data are replicated,” said Chris Cummings, VP of marketing at FalconStor.
“Global deduplication ensures that redundant data from remote offices is eliminated prior to replicating to the central repository, minimizing space requirements.”
4. Sub-File Deduplication
Deduplication can take place at the file and the sub-file level. In some systems, single-instance storage (SIS) is done where complete files are compared. Thus, if a minor modification is made, the entire file has to be stored again.
Sub-file or block-based deduplication raises efficiency, as data is broken into sub-blocks and assigned an identification key. If two identical hash keys are identified, the blocks are identical.
Once it is determined that a block of data already exists in the deduplication repository, the block is replaced with a pointer linking the new sub-block to the existing block in the repository.
5. Capacity-Based Pricing Should Be Avoided
Some current services base their pricing on data volumes as well as the modeling of deduplication rates, since the amount of data reduction achieved by eliminating duplicate files, which can be 20:1 or higher in some cases, can be expensive.
“Anyone attempting to measure data volumes and deduplication modeling for each customer will tie up IT resources and eat into profits,” said Cummings with FalconStor.
“The best approach is to find a fixed low price per month.”