Cloud storage cannot replace POSIX file systems fully, so various technologies will need to co-exist.
The world has had only one agreed upon file system standard from about the mid-1980s. It is based on the POSIX (Portable Operating System Interface).
POSIX came from IEEE Standard 1003.1-1988, released in 1988. And the last change to POSIX file system I/O was to add asynchronous I/O system calls back in 1991. One of the big backers and participants for the development of the POSIX standard was the US government, as they were less than happy with having to port applications to various hardware platforms and operating systems from vendors. The various vendors ranged from Digital Equipment Corporation to IBM to HP, and included various flavors of UNIX from Sun and others.
Right or wrong, the thought was that if you have a common application's operating system, interface applications will be portable and the US Government would not have to worry about application porting. Of course we all now know how naïve that was, but that was the goal.
With the advent of the Web, a new interface was needed. You cannot do system calls or C library calls over http, so Representational State Transfer (REST) was developed back in 2000 by Roy Fielding as a way to have an interface to Web servers.
In the last 12 years, the REST interface and REST applications have exploded – especially over the last 3 years, with the movement of applications and storage into the cloud.
So what I want to explore this month is the big question: Will REST overtake POSIX as an interface of choice for all applications?
POSIX has been around for a long time and has rich interfaces. You have of course the C library interface with open/close, read/write, and the ability to randomly read or write data within a file with use of fseek.
It should be noted that Java supports the C library interface when it is doing file reads and writes but also supports REST. The system call interface provides more richness, with direct system calls to the data with no application level buffering, and with the addition of many more features on the open system call and support for asynchronous I/O.
POSIX has been around for almost 35 years and has, I would guess, many millions of applications that support using the standard, more likely billions. POSIX, on the other hand, has not been updated in over 20 years. There have been proposals to update POSIX, but since it is controlled by The OpenGroup, which has significant input from the vendor community, they do not want to make any changes. Changes cost money both for development and more importantly for test suites to validate the standards. And the time to run those test suites with updates to the operating system stack.
The POSIX interface for things like metadata consistency and multiple threads writing to the same file are burdensome for scaling file systems with billions of files and scaling applications that might require parallel I/O like, for example, a database. The POSIX interface allows you to access parts of files so that you can read and write before the whole file arrives, unlike REST.
The Achilles heel for POSIX in my opinion is file system inodes and the requirements for atomicity imposed upon the file system by the standard. The command ls –l</> and the requirements around it are the enemy to scalability for most POSIX file systems.
I think the biggest strength of the REST interface is that the backend management of the files or objects is left up to the developers of the management system. The same could be said for POSIX file systems but the number of things imposed upon the developers limits what can be done.
SOAP (Simple Object Access Protocol) is similar to REST, but REST is less strongly typed than its counterpart, SOAP, and does not require XML. The REST interface, which uses http, has a modest set of methods for accessing the objects. Examples include:
REST uses these access methods and other functions and features via the well-defined HTTP protocol. HTTP is used to address proxy and gateways, caching and security enforcement. And it allows application developers to define new, application specific methods that add to the current well-defined HTTP methods. For example, methods might include:
• createPurchaseOrder(string CustomerID, string PurchaseOrderID)
SOAP, though similar to REST, has some advantages. The biggest advantage of SOAP over REST comes from REST’s use of HTTP. Since SOAP does not use HTTP and HTTP conventions, SOAP works well over raw TCP, named pipes, message queues and other direct connections, but has the same advantages as REST, as the interface is not via system calls but via the file object.
We do not have a lot of POSIX file systems that scale today to tens of PBs and billions of files. There are three file systems in production with a parallel namespace (Gluster, PAN-FS, Lustre, and GPFS) and a new entry called Ceph.
Ceph, GPFS Lustre and Pan-FS support parallel I/O, which is I/O from multiple threads (these threads could be running on multiple nodes) to a single file, but Gluster does not. On the other side there are dozens of vendors developing REST- and SOAP-based object management interfaces.
Vendors are trying to create systems that support billions of objects in a single namespace. Given that the vendors are not constrained by the POSIX atomicity requirements and support for parallel I/O, this is far easier than developing this support inside a POSIX file system.
The main reason this is easier is that the interfaces with REST and SOAP are far narrower than POSIX and are encumbered by the standards process controlled by vendors. With REST and SOAP you can have policies on file for replication to remote locations, policies for access control, policies for encryption, etc. Each of these policies does not have to be done in POSIX inodes and if they were done in a database there might be consistency issues if there was a crash between the inodes and the database, not to mention the time to fsck (check the file system consistency).
On the other hand if the object is really big, I cannot use POSIX reads and writes to start reading the object until the whole object has been moved with a REST or SOAP interface. This might not be important in most application environments but it is going to be important in applications that need to process data before the whole file is there. This is important for applications such as oil seismic traces, raw video feeds and others; clearly, not your everyday applications. But it is still important to many communities where the files are very large and need to be processed before the whole file is received. And don't forget that all databases randomly position into the files.
I do not see C programs written for the oil industry being rewritten and accessing seismic traces, as they need a POSIX file system interface, given the performance requirements of parallel I/O. The industry challenge is that the current de facto standard interfaces for file systems is not meeting requirements for scaling to tens or hundreds of billions of files. And there is no movement to change the basic interface, and without a change POSIX file systems are going to be challenged to compete with file systems with REST and SOAP interfaces.
Is it going to become a fight between remote and local data access? POSIX file systems have a real competitor for data access at large scales for cloud applications that do not need the POSIX interface and all the overhead that goes with POSIX, but with all of the features such as asynchronous I/O and random positioning that go with POSIX and that are required for things like databases.
I personally think that long term we are going to basically have three types of data storage interfaces for both clouds and local data access. The first type will be our local computer, as there is not enough bandwidth on the planet – given the irregular connectivity – to be able to access files quickly for most of us. And therefore we are going to have local storage to deal with the issue.
The second type will be POSIX file systems. I think that the shared and parallel POSIX file systems are going to gain more and more market share with file system clients being distributed across networks of machines. NFS is not meeting the scaling and performance requirements in today’s storage market requirements so I think file system clients will be a larger part of the hierarchy. Yes, NFS will continue to exist but as part of parallel file system hierarchy.
Third will be object access by non-POSIX interfaces, what we know today as REST and SOAP, but might in the future include other methods. Who knows? I also think that you will have storage that has both POSIX and REST/SOAP interfaces from the current parallel POSIX file system vendors that will be part of this hierarchy. Cloud storage cannot replace POSIX file systems fully so we are going to have to coexist.