Many computing operations throw off lots of copies: prime offenders include backup, analytics, snapshots, cloning, and test/dev. And not only do you have many copies by many processes, each of these copies is proprietary to its generating application. It is not possible to re-use that data for multiple processes, leaving your storage landscape littered with duplicate data that cannot be leveraged or re-used. Not even cloud users get away scot-free; they are still paying for that storage space and bandwidth, and those copies will be exclusive to the process that created them.
For decades this siloed, crazy quilt environment has been business-as-usual because there was nothing much that people could do about it. Data protection, analytics, and testing systems all generated their own copies of data because they had to: it was the only way any of the processes could work.
This challenging state of affairs spurred Actifio to launch data copy management in 2009. The question they asked was: what if a single product could eliminate duplicate data across multiple processes by providing a single golden copy of that data for all of them? What if a single product could capture data copies from multiple applications, store a single copy of that data, and then virtualize it wherever it was needed by data protection and business applications?
This article will look at what data copy management means today and how various vendors are implementing it.
The Problem of Copies
Data falls into two major classes: production data and copies. It sounds easy enough but the “copies” side is pretty complicated and contains a number of subsets:
1. Analysis and data mining. Analytics applications often copy production data into worksets for analysis and data mining. Although subsequent analysis may replace a workset copy with a new one, historical trending depends on older copies. And as production data to be analyzed grows, so does the number of worksets.
2. Backup. Backup application help to control backup volume sizes with incremental backup, dedupe, and data retention management tools. However, this leaves a huge amount of proprietary backups stored in on-premise disk and tape, off-site tape vaults, and the cloud. No other process can make use of this dizzying amount of stored data.
3. Snapshots. Snapshots are critical to point-in-time recovery. The images also take up a lot of copy space, which only grows over time. Multiply snapshot images by hundreds and thousands of virtual machines and you have a lot of copies.
4. Virtual cloning.Cloning creates a new VM from an existing one. The cloning process is a critical time saver in virtual environments but over time clones grow and take up more and more storage space and processing cycles. It is far simpler to spin-up a new machine than to locate an under-utilized one and spin it down.
5. Test and development (test/dev). Test/dev teams work with copies of production data. As the development cycle continues, copies get larger and multiple copies enable the developers to test different options and cases.
Copy management converges all of these separate copy processes into a single source that feeds them all. If you can use the same copies to feed some or all of these subsets, then you would be running an extremely efficient – and much less costly — operation.
Actifio is the creator and reigning champion of the copy management class. Actifio Virtual Data Pipeline (VDP) sniffs duplicate pieces of copy data, stores a single copy and makes virtual copies available to multiple processes using their native formats. For example, VDP is aware of duplicate data in separate replication and a backup processes. It retains a single copy of that data and virtualizes it as needed in the native formats of these distinct processes. Actifio consolidates copies for backup, snapshots, disaster recovery and test/dev; eliminating separate products for backup, snapshots, cloning, and more. In-band options depend on IBM SAN Volume Controller (the well-known SVC) over Fibre Channel; the out-of-band option operates over iSCSI or application-specific APIs.
Arguably Actifio has both defined the class and fulfilled it. But there are other vendors who are working to manage copies across multiple processes, vendors and/or applications (as opposed to simply, say, deduping data within a single backup process).
Unlike Actifio’s single golden copy method, Catalogic Software’s Copy Data Management platform catalogs file copies and snapshots generated by NetApp and VMware. It federates search across local, remote, and cloud storage including data located in vaults and mirrors. Moreover it updates stored snapshot copies, which can be replicated across different data protection applications. These writeable snapshots are also useable in test/dev and analytics environments.
Yet another incarnation of data copy management is HDS’ Hitachi Data Instance Manager (HDIM). This is a limited solution for Windows environments that manages backup and CDP copies across multiple storage systems. Integration with Hitachi Content Platform and Microsoft Azure adds copy management-based archiving for files and email.
EMC does not have a data copy management product in its arsenal but has reorganized to support the functionality. In 2013 EMC organized a new group called Data Protection and Availability Group made up of VPLEX, RecoverPoint, Networker, Avamar, and Data Domain. Here is what EMC’s thoughtful Dr. Stephen Manley said about a shift from backup to data protection: “I want to do more with my data… I want to be able to use my data for disaster recovery… I want to boot it up instantly… and for that to happen I need to keep the data in its native, original format. I can’t be putting it into proprietary, lock-in formats… that way I can do backup, disaster recovery, and archive, because I’ve got the data in a form that’s useable.” We will see if EMC develops product specific to data copy management, or if they will utilize existing RecoverPoint management services.
IT has lived with data silos for a lot of years and some organizations will live with them for many more. Change is hard to introduce into any computing infrastructure, and simple ones will not need copy data management any time soon. However, the more complex the infrastructure the harder and more costly it is to manage massive copy generation. How costly? Data copies of whatever stripe consume as much as 65% of storage system capacity. This comes with a price tag of billions of dollars simply to store data copies.
This makes data copy management a real advantage in complex computing infrastructure, not only for data protection but also for business groups that depend on production data copies. By managing multiple processes under a single interface, IT saves on storage costs and resource overhead and also benefits users by efficiently virtualizing production copies for analytics and test/dev.
Photo courtesy of Shutterstock.