Ensuring the Future of Data Archiving
The word "archive" has been thrown around for years and means lots of things to lots of different people. Hardware vendors offer various archive platforms, including tape, disk and optical, and some claim NAND flash will be used for archive eventually too. I could talk about the advantages and disadvantages of various hardware mediums for archive, but when the rubber meets the road, it is the software that is going to ensure that your data can be accessed after it is archived.
Software is needed to manage files and objects. Software is needed to write and read the files and objects to archival storage, and software is definitely needed for format migration. So what are the software requirements that will allow people to archive to whatever hardware they choose (or whatever hardware the market chooses for them)?
A complete examination requires looking at everything from interfaces to archive formats. No matter what anyone tells you, there is data that does not need to be on primary storage, and with the exponential growth of data, some of which might not be used for years, there is a need for archiving data—and for making sure that you’ll be able to access it and use it long after formats and interfaces have changed.
The archive interface of choice even five years ago was NFS or FTP, and in the HPC world it was GridFTP and Aspera (now an IBM product). Today this is no longer the case with REST, S3 and other interfaces becoming popular for archiving.
What is missing at the interface is the creation of a collision-proof hash for a file as part of the movement to the archive. The hash is needed to ensure the reliability of the data in case there is a silent corruption over the years, and it is also needed to prove that the file has not been tampered with. This collision-proof hash needs to be considered in context of how long the archive is going to last or how long before you want to dedicate resources to create a new hash. So you need to ask the question, will a SHA-256 hash, for example, be good enough in 10, 20 or 50 years? Do you want to pay the price for re-computing hashes with the likely CPU improvements in 10, 20 or 50 years?
Interface software needs to be able to do what you want it to do. If you want to spend the money and time upfront and use SHA-512 instead of SHA-256 or SHA-128, options should exist in the interface software to allow this to be done. Adding this functionality in NFS is not feasible, nor is it feasible to add it to FTP given that it would take a change by the Internet Engineering Task Force (IETF). These types of features could be added to applications such as GridFTP and Aspera, but even if they are added, these applications are not part of an archive software stack. S3 and REST could also add these features fairly easily, and they could easily develop interfaces to pass the hash to the archive software stack.
Let’s assume that the interface software to the archive has done its job and created a collision-proof hash that has been passed to the archive system. The software now must validate the hash for the file or object. (As we move to the future, it is likely going to be objects, so that is the term I will use from now on.)
After the hash is validated, the software then needs to confirm the validation so the object does not have to be retransmitted, and then store the object on the storage appropriate for the object and the software. It would be very nice to have administrator- or user-definable information about the length of time the object will be kept and the importance of the file, defining the reliability requirements over time and determining the copy count based on the reliability of the media being used.
The archive software needs to be able to search the archive efficiently based on what is searchable in the objects (geolocation, user, group, project, date, etc.) and the security requirements for each of the objects. We have all heard of the huge number of data breaches over the last few years, and the whole issue of archive security, starting with per-object security, is going to be critical over the long haul. Security needs to be built in up front rather than an afterthought or an add-on.
The archive software needs to have features such as format conversion, for example, for converting PDF 1.3 from around 2000 to PDF 1.7 which is in use today. The whole issue of format conversion is a touchy subject, as you will need to create a new hash if you convert to a format and the original file has now been changed. In the archivist world, especially large libraries and preservation archives, this is a big deal, as some have mandates to keep the data in the original bit-for-bit format. At some point in the next few years, this is going to have to be dealt with, but for now let’s assume that format conversion can take place as I have described.