Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
A modest proposal for data continuity
Unlike Jonathan Swift in his 1729 satire, I am going to propose something serious—but it approaches comedy when I consider the odds of it ever happening. I would like to suggest that an ANSI, ISO or IEEE committee come together and create an open standard for self-describing data. This format must encompass all other formats that exist today in weather, multiple medical formats, geospatial, genetics and so on. This working group could meet and get agreement across various industries in pretty short order, I believe. Just like wrapping files that are already wrapped. This clearly doesn’t solve the whole problem with its long-term issues, but it does get us to a common agreed format. This could also be used for any other file type like a jpeg.
This proposal comes with some significant problems, not the least of which is the fact that getting the right people in the room will be difficult at best. I do not think the issues will be as much technical as much as getting people to think of the value beyond their industry.
The only possible outcome is that there would be a cutover point where old data in the archive is read when needed and any new data is under a new format. Others will say, “Who needs a new format, as our format is open and standard?” But multiply that across a thousand file formats and you begin to grasp the extent of the problem. This does not end well for big data analysis of the future, where people are looking at relationships we have not even thought of with historical data. What will be lost if not all data can be read?
Back to Egyptian data loss
I am not sure why the first obelisk was so well preserved and the second one was not. There can be lots of reasons for data loss. In the case of the obelisk, sand storms and water are just two examples. The fact of the matter is that we would not be able to read the first one had it not been for the Rosetta Stone and the three translations on the same rock. So there are two points of potential data loss for future generations trying to read our data:
- Data failure, like the second rock
- Our ability to translate the data, like the first rock
Because it is about data protection, preventing data loss in the case of the rock worn away by time is reasonably simple with multiple copies in multiple places, with checksums and checking on each copy. Translation, on the other hand, is far more difficult. Besides having format winners and losers, there are issues of formats changing over time to the point that they quickly become unreadable. There are open source formats like PDF and standard formats like jpeg 2000, but even those might change over time. And how long they are backward-compatible is up to the group of people who control the standard.
The bottom line is our future depends on people who might or might not be thinking about the past and the future and the importance of data preservation. What about all of the other formats, and what about formats like Microsoft Word that have had compatibility issues over time? People need to start thinking about data formats, and the time is now given the growing enormity and complexity of the problem. The technology problem is not difficult, but the politics surrounding the problem are currently insurmountable. We seem headed for massive data loss unless we act.
Photo courtesy of Shutterstock.