Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Secondary Media Formats
There are lots of media formats that have been used over the years for archive software, such as tar, gtar, LTFS (tape-based format), UDF (optical-based format) and a bunch of proprietary formats. As we move into large archives for objects, things are going to have to change in a big way. None of those media formats support having a hash integrated into the format.
Data like the POSIX user and group permissions are examples of what needs to be transmitted to archive formats, but also things like ACL. One of the big problems with long-term archives is that people come and go and they also die, so the concept of ownership rights over time must be addressed. Tar and gtar were used for disaster recovery reasons. You could always recover your data in the event the front-end system blew up. LTFS and UDF add the concept of being able to move data around, as UDF and LTFS are basically file system mount points, so you could mount the file system and look at the metadata without reading all of the data. The problem with both—UDF to a lesser extent than LTFS—is the lack of integration with standard user access controllers for ownership and usage. UDF has its own issues and only is used for optical, so rewriting permissions is not possible without rewriting the whole file. The point is that secondary media formats are not the full solution to disaster recovery today if users own files rather than applications, and even if that is addressed there is still the whole issue of security.
Next-Generation Archiving Requirements>/h3>
- End-to-end data integrity: Sooner or later, the movement of data around the data center or around the world is going to have to address end-to-end integrity. This kind of information is going to have to be immutable and live with the object for its life. It is also going to have to be validated at access. We need a standards-based framework to do this and therefore a standards body to work this out.
- Security: This includes far more than UNIX user and group permissions and deals with things like the mandatory access controls that exist in SELinux. Equally important is auditing what happens with each user and each activity, including file access. All we have to do is look at the huge number of security breaches to know why this needs to be done as soon as possible.
- Format migration: How do you migrate formats as technology changes? There needs to be agreement and understanding that you cannot keep objects in the same digital format for decades, much less thousands of years. And there needs to be agreement on how objects can and should be changed and how it all relates to integrity management and security management.
- Secondary media formats: If these formats are used for disaster recovery on secondary media, then they have to support everything from data integrity to security and even potentially the provenance of the object. If these formats are going to be used to restore in the event of a disaster, then how can you trust the integrity of the data unless you have all of the information>
Archives are a different beast than transient data on file systems, and we need to start thinking of archives in a different way. It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data.
Photo courtesy of Shutterstock.