Data Storage: REST vs. POSIX for Archives and HSM - Page 2
Clouds allow easy consolidation of the data from all departments and groups, and people will have more stuff than they do now as space is unlimited. Right? Just look at how AWS, Dropbox, Google drive or SkyDrive are being used and the cost is continually dropping. Dropping faster than the rate of density growth and/or disk price drops, which have not been much lately.
People often dump lots of data that they never use. Here is a good example. I had lots of videos of my grandmother and when she died a few years ago I reviewed them and put together a few for a family get-together. I added up the space I had in pictures and video of my grandmother and it came to about 5 GiB.
Now that’s not a lot but maybe I had the data in the cloud and found that much of it was not used for 5 or 10 years. And it would have been automatically moved to devices that don’t use much power and have a lower cost point such as tape does today.
If I went to get these files and got a message on the screen saying that the download would begin in about 3 minutes (more than enough time if the tape drives were busy and to pick and load a tape and get my files) it would not bug me at all – especially if I were paying less for storage or using a free service.
Now if this was important for my business, even for that type of data, most people that had not accessed the data for 3 to 5 years could wait 3 to 5 minutes to get it back. In my opinion given the network latencies and bandwidth available today for even big files, you are going to in fact wait to put the data in the cloud or get it out of the cloud.
So if the data is not local the fact is that, in the future, you’re going to wait for your data for at least a few minutes before it arrives. Let’s look at what I am talking about. Take my 5 GiB of data to download. It would start in 3 minutes or so. With my home Internet of 35 Mbit/sec that gives me, say, 4 MB/sec of download or 1280 seconds (a bit over 21 minutes) to get all my data back. In my opinion a small price to pay if my data is already in the cloud and I am paying less on secondary storage.
The bottom line here is that most if not all of the cloud implementations use a REST/SOAP interface. What we see in our directories is based on using those interfaces and not POSIX. What is missing is a rich set of interfaces that could automatically give hints for tiering storage, that provide data integrity and that could show an approximate amount of time it will take to get my data back.
The file system development has not changed in decades, while at the same time there have been lots of technology changes. The vendors that control the standard for the most part do not want to change anything. This is because change costs significant money in terms of potentially developing the features, testing and developing the tests for the new functionality, running the tests, and the hardware costs. And who is going to pay for these changes?
This was and is myopic thinking on the part of the vendors controlling POSIX. Good short term strategy but not good for the long haul. Today, we have no large SAN or local file systems that scale, and the Red Hat supported limit for XFS (the largest file system supported) is still only 100 TB, which is tiny today given the size of file systems people really want.
Today we have a few parallel file systems that scale to 10s of PB in a single namespace and these are mostly done with parallel file system appliances, for the most part that is it. Interfaces to tiered storage with policies that are not standardized and there are no plans as far as I can tell. I actually made a proposal to some groups to try and add this, but got nowhere.
At this point, scaling to 10s billions of files in a single namespace is difficult and often expensive. Recovery time for POSIX file systems is a long process, given the required consistency and the fact that the file system by default has to control access to two threads trying to write to the same file.
The real issue is that there is no standard way for the file system to put and get things out of secondary storage. 20 years ago there were many 10s of POSIX HSM vendors and systems. Today there are less than 10 from what I can determine, and most of the 10 are not growing anywhere near as rapidly as storage in the cloud market.
If you are under 30 and doing cloud or other development you are mostly likely developing to a REST or REST-like interface and not using a POSIX file system for your cloud. This allows you far more flexibility than does POSIX. Because instead of the backend having to be a file system that has to deal with the VFS layer and inodes, from what I can tell most REST interfaces put the data in a database to manage all of the files.