The word “archive” has been thrown around for years and means lots of things to lots of different people. Hardware vendors offer various archive platforms, including tape, disk and optical, and some claim NAND flash will be used for archive eventually too. I could talk about the advantages and disadvantages of various hardware mediums for archive, but when the rubber meets the road, it is the software that is going to ensure that your data can be accessed after it is archived.
Software is needed to manage files and objects. Software is needed to write and read the files and objects to archival storage, and software is definitely needed for format migration. So what are the software requirements that will allow people to archive to whatever hardware they choose (or whatever hardware the market chooses for them)?
A complete examination requires looking at everything from interfaces to archive formats. No matter what anyone tells you, there is data that does not need to be on primary storage, and with the exponential growth of data, some of which might not be used for years, there is a need for archiving data—and for making sure that you’ll be able to access it and use it long after formats and interfaces have changed.
Interfaces
The archive interface of choice even five years ago was NFS or FTP, and in the HPC world it was GridFTP and Aspera (now an IBM product). Today this is no longer the case with REST, S3 and other interfaces becoming popular for archiving.
What is missing at the interface is the creation of a collision-proof hash for a file as part of the movement to the archive. The hash is needed to ensure the reliability of the data in case there is a silent corruption over the years, and it is also needed to prove that the file has not been tampered with. This collision-proof hash needs to be considered in context of how long the archive is going to last or how long before you want to dedicate resources to create a new hash. So you need to ask the question, will a SHA-256 hash, for example, be good enough in 10, 20 or 50 years? Do you want to pay the price for re-computing hashes with the likely CPU improvements in 10, 20 or 50 years?
Interface software needs to be able to do what you want it to do. If you want to spend the money and time upfront and use SHA-512 instead of SHA-256 or SHA-128, options should exist in the interface software to allow this to be done. Adding this functionality in NFS is not feasible, nor is it feasible to add it to FTP given that it would take a change by the Internet Engineering Task Force (IETF). These types of features could be added to applications such as GridFTP and Aspera, but even if they are added, these applications are not part of an archive software stack. S3 and REST could also add these features fairly easily, and they could easily develop interfaces to pass the hash to the archive software stack.
Archive Software
Let’s assume that the interface software to the archive has done its job and created a collision-proof hash that has been passed to the archive system. The software now must validate the hash for the file or object. (As we move to the future, it is likely going to be objects, so that is the term I will use from now on.)
After the hash is validated, the software then needs to confirm the validation so the object does not have to be retransmitted, and then store the object on the storage appropriate for the object and the software. It would be very nice to have administrator- or user-definable information about the length of time the object will be kept and the importance of the file, defining the reliability requirements over time and determining the copy count based on the reliability of the media being used.
The archive software needs to be able to search the archive efficiently based on what is searchable in the objects (geolocation, user, group, project, date, etc.) and the security requirements for each of the objects. We have all heard of the huge number of data breaches over the last few years, and the whole issue of archive security, starting with per-object security, is going to be critical over the long haul. Security needs to be built in up front rather than an afterthought or an add-on.
The archive software needs to have features such as format conversion, for example, for converting PDF 1.3 from around 2000 to PDF 1.7 which is in use today. The whole issue of format conversion is a touchy subject, as you will need to create a new hash if you convert to a format and the original file has now been changed. In the archivist world, especially large libraries and preservation archives, this is a big deal, as some have mandates to keep the data in the original bit-for-bit format. At some point in the next few years, this is going to have to be dealt with, but for now let’s assume that format conversion can take place as I have described.
Secondary Media Formats
There are lots of media formats that have been used over the years for archive software, such as tar, gtar, LTFS (tape-based format), UDF (optical-based format) and a bunch of proprietary formats. As we move into large archives for objects, things are going to have to change in a big way. None of those media formats support having a hash integrated into the format.
Data like the POSIX user and group permissions are examples of what needs to be transmitted to archive formats, but also things like ACL. One of the big problems with long-term archives is that people come and go and they also die, so the concept of ownership rights over time must be addressed. Tar and gtar were used for disaster recovery reasons. You could always recover your data in the event the front-end system blew up. LTFS and UDF add the concept of being able to move data around, as UDF and LTFS are basically file system mount points, so you could mount the file system and look at the metadata without reading all of the data. The problem with both—UDF to a lesser extent than LTFS—is the lack of integration with standard user access controllers for ownership and usage. UDF has its own issues and only is used for optical, so rewriting permissions is not possible without rewriting the whole file. The point is that secondary media formats are not the full solution to disaster recovery today if users own files rather than applications, and even if that is addressed there is still the whole issue of security.
Next-Generation Archiving Requirements>/h3>
In the future, requirements for archive systems are going to have to deal with the following issues:
- End-to-end data integrity: Sooner or later, the movement of data around the data center or around the world is going to have to address end-to-end integrity. This kind of information is going to have to be immutable and live with the object for its life. It is also going to have to be validated at access. We need a standards-based framework to do this and therefore a standards body to work this out.
- Security: This includes far more than UNIX user and group permissions and deals with things like the mandatory access controls that exist in SELinux. Equally important is auditing what happens with each user and each activity, including file access. All we have to do is look at the huge number of security breaches to know why this needs to be done as soon as possible.
- Format migration: How do you migrate formats as technology changes? There needs to be agreement and understanding that you cannot keep objects in the same digital format for decades, much less thousands of years. And there needs to be agreement on how objects can and should be changed and how it all relates to integrity management and security management.
- Secondary media formats: If these formats are used for disaster recovery on secondary media, then they have to support everything from data integrity to security and even potentially the provenance of the object. If these formats are going to be used to restore in the event of a disaster, then how can you trust the integrity of the data unless you have all of the information>
Archives are a different beast than transient data on file systems, and we need to start thinking of archives in a different way. It will not be long until your genome is tracked from birth to death. I am sure we do not want to have genome objects hacked or changed via silent corruption, yet this data will need to be kept maybe a hundred or more years through a huge number of technology changes. The big problem with archiving data today is not really the media, though that too is a problem. The big problem is the software that is needed and the standards that do not yet exist to manage and control long-term data.
Photo courtesy of Shutterstock.