Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
I think it is safe to say that data deduplication technology on the backup side has now gone mainstream. At the enterprise level we estimate that almost all companies have installed some form of data deduplication technology, even if the percentage of data under the jurisdiction of dedupe is closer to 40 percent. In other words, each enterprise has dedupe equipment, but not all backup data is yet being deduped. The reasons vary with each enterprise (importance of application, geographic isolation, limited funding, and so on). When one looks under the covers one finds that some of this deduplication is happening at the source but a large majority at a target, and more specifically, a target appliance.
Clearly, EMC (Data Domain and Avamar) rules the dedupe world today. Other players include IBM with ProtecTIER, NetApp, Quantum, Symantec with PureDisk, Sepaton and HP (Sepaton and its own B43xx series). Very recently, HP announced the B6200 series. ExaGrid has an offering, but it caters primarily to the SMB space. A number of other backup products have some form of dedupe built in. Similarly, several new primary storage offerings have dedupe of some form or another, but we will focus on them at another time. So by all measures the market is healthy and growing.
So what's the problem?
The problem is there are way too many point products and many inconsistencies. Ideally, we believe there should be one deduplication technology that should apply to primary storage array, the source, the backup software, the backup target appliance, and the archive. And data should be deduplicated at the earliest point possible (closest to its creation) and run its entire life cycle in the shrunken format. It must only be hydrated back to its original form when it needs to interact with an application or is presented to a user. The movement of data from one place in the organization to another should not require the data to be rehydrated. This means it must be replicated and stored on the remote site in the same shrunken format that it started out as. Whether the deduplication is done at primary storage, or at "source" or "target" then becomes a matter of choice for the customer. Regardless, the storage needs are kept to a minimum.
There is nothing new about this vision. We have articulated this vision over the past five or more years. What is different is that, as an industry we are closer to seeing this vision become reality. To be sure, it is still incomplete but the progress is excellent. Look at what HP did this week. On November 29, it announced a storage technology called StoreOnce that shows up as "source" or "target" within Data Protector or as a "target appliance (B6200)" for other backup software products. And according to HP the same technology will be added to their primary storage offerings.
While HP may have been lagging behind others, especially EMC, it leapfrogged everyone else in one stride. EMC has done a great job integrating the Avamar and Data Domain products during the past year, but the two formats remain incompatible by definition. Regardless, that has not held it back from being the clear leader in this space. Dell bought out Ocarina, which is a compression and single instancing technology, as distinct from data deduplication, but with the express intent to extend that technology to general data duplication use cases. Ocarina was already on the path to developing a consistent dedupe technology that could apply from primary storage to archiving and we know Dell has a publicly stated goal to deliver on this vision. We expect that IBM is also feverishly working on extending the ProtecTIER technology beyond its "target appliance" use today. Permabit has developed a dedupe technology that can be embedded at any level by an OEM.
The race is on. And I believe we are closer to achieving this than one may imagine. HP has certainly broken several barriers in one fell swoop. But the other majors are not far behind. EMC has had a head start in this market and remains the vendor to catch. IBM is doing well with its ProtecTIER product line. Dell may have been relying on carrying the Data Domain product from EMC but that deal has fallen apart. So Dell needs its own offering soon.
The key takeaway is for IT to visualize its end to end environment and see this technology for what it is: a crucial, strategic, pervasive technology that when implemented correctly will have far reaching CapEx, OpEx and competitive implications.
Arun Taneja is the founding analyst of the Taneja Group.