Data Storage: REST vs. POSIX for Archives and HSM
Here is my working hypothesis: POSIX (Portable Operating System Interface) can’t scale to meet the demand of clouds and big data, but REST (Representational State Transfer) can’t manage and tier data the way POSIX can, but will likely get those features in the next few years and then take off as the new data interface standard of the cloud era.
There is a lot going in the archive world as archives are becoming far more important, given that companies and researchers are looking back at data to gain a better understanding of our world and help predict the future.
Let’s first address the difference between an archive and (HSM) hierarchical storage management. My view is that the definition of an archive is about storing data that will be needed in the future, while the definition of HSM is managing the archive with hierarchies of storage. There is a lot of money in predicting the future, whether it be commodities traders, the healthcare industry agricultural planning or some other industry. Some use tiers of disk, but the ones that I am taking use tape as one of the tiers, given the high reliability and lower cost.
This a chart that illustrates these concepts:
But this is an article on archive interfaces, not the underlying archive. Some industries, such as the geosciences companies that collect seismic information from around the world, have known about the importance of archives for decades. All of the vendors in this market I am aware of use archives with POSIX interfaces.
The archive for a geoscience company contains their intellectual property and must be saved and usable for the future as new algorithms have allowed these companies to both find oil using old survey data and find better ways to extract oil and gas. Every industy I have worked with that has an HPC compute environment has an archive, some of which have decades of data.
Other examples include weather forecasting, simulation of aircraft and simulation in the auto industry, climate modeling to whatever the simulation environment or type of industry it is – all these environments that I am aware of are today are accessing the data via a POSIX file system interface.
This interface is either running on the system directly or accessed via NFS or CIFS or some archive specific API and/or ftp. Today the majority, if not all HSM software, uses this POSIX interface. The companies that make the software have been working in industries that need tiered storage and high reliability.
Now we know there is a company that shall remain nameless, that has said for over two decades that tape is dead, but tape is still not dead yet and I believe it is actually making a comeback for POSIX archives, given the cost differences. Now some will say that tape is dead based on the Santa Clara Group’s reporting of the LTO market, which has seen some significant sales reductions. But sadly the Santa Clara Group does not and cannot report on enterprise tape because the enterprise tape vendors do not report sales.
Enterprise tape has greater reliability, high performance, and more density than LTO tape and in some cases cost about the same as LTO tape, given that you need fewer cartridge slots in the libraries given the increased density, and fewer tape drives given the performance improvement.
The point I will try and make is that the interface to the archive is going to change dramatically over the next 5 years, from a POSIX interface to REST. And the people that are developing REST interfaces are going to have to tier storage, given the costs and density improvements. Here are some of the reasons:
1. Clouds environment such as Dropbox, Skybox, Google and AWS
2. POSIX is not changing and for long term archives needs to change
3. REST and other similar interfaces have more development happening
Let’s review each of these.
Whether it be a private or public cloud, the drop and drag ability to save information that you could not afford to save on your local storage is allowing a great deal more data saved.
So now you have a situation where you can save a lot more stuff and you do not really see the charges and/or costs. So the people who run these types of sites are going to have to tier storage to reduce costs.