Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Even 20 years ago applications ran in parallel in novel research projects at companies such as Thinking Machines and Kendall Square Research. Neither of those companies are still in business, nor are almost all of the companies started during the early days of parallel architectures, but parallel applications are used everywhere today and not just for predicting the weather or designing airplanes and cars.
Search engines such as Google, open source Hadoop, databases such as Oracle and DB2, and many others are parallel applications.
Parallel applications can do I/O in two different ways: They can write to a parallel file system such as GPFS, Lustre or Pan-FS, or they can write to a local file system on each server node in the configuration. The problem is that there is a dirty little secret: We have no standards other than the Message Passing Interface (MPI) for parallel processing communications and MPI-IO for parallel I/O. And I do not see any changes in sight because, in my opinion, the vendors that now control the standards bodies do not want to change anything as they can sell proprietary solutions that cost more in the long run, and lock you into that vendor. And it costs much more if you want to change vendors.
Open-source alternatives for search engines, such as Hadoop, have parallel communications, but are doing I/O to a local file system. I do not think this is a good long-term plan for I/O, which is why Im ranting.
Why we should all want parallel I/O
Since the 1980s, when standards for UNIX were agreed upon, weve had standards for the main system calls (open, read, write, close, lseek). These system calls were used to implement libc and we had standards for library calls such as fopen, fread, fwrite, fclose, fseek.https://o1.qnsr.com/log/p.gif?;n=203;c=204650394;s=9477;x=7936;f=201801171506010;u=j;z=TIMESTAMP;a=20392931;e=i
Since the 1990s we have gotten almost nothing in the I/O area for user applications other than the addition in the standards of a few asynchronous I/O system calls. This means that the interface from the application and operating system to the file system is virtually unchanged.
I had very high hopes that object storage, which had a rich interface, would have allowed changes to the application interface. All of these hopes were crushed and the broad market did not deliver products based on the Object-based Storage Device (OSD) standard -- not because it was not a good technical idea, and not because the storage vendors could not have made profits, but because of the required investment and the worldwide economic downturn.
So here we are with no changes 20+ years after the standardization of application I/O interfaces.
The narrow interface to the file system limits what existing parallel file systems can do to improve the performance for application I/O. Of course, many vendors add interface features to their file systems that allow application optimization for those specific file system implementations. But each vendor implements the application interfaces differently, which is understandable because there are no standards for parallel I/O or parallel file systems, and the proposal that was made a few years ago to the OpenGroup was ignored. The parallel system vendors still have to follow the standards requirements for file systems and the associated commands (ls, df, find ..), which causes significant impacts for testing the standards and the performance of I/O.
Why are things the way they are, and what caused us to get to this point? For example, why did Hadoop implement its search engine as a local file system, as did others? One of the reasons is simple: The designers had to do local I/O for performance reasons. Parallel I/O with parallel file systems is slower than local I/O given all of the management requirements to meet the POSIX standards. Even if each application opens a separate file from each thread or task, parallel file systems are slower than local file systems. This is often called N to N (N tasks to N files). Things get much worse, or for some file systems impossible, with N to 1 (N tasks to a single file), which is critical when you have hundreds of thousands, or millions, of tasks opening files.
The other main reason Hadoop does local I/O is the cost of the storage communications network. Having local drives on each node is very cost effective, and given the performance issues with parallel I/O and parallel file systems there is no compelling reason to do anything but local I/O.
So whether the application is a database, a search engine, high performance computing (HPC) or any combination of those applications, parallel I/O is used. In the HPC world, where applications communicate via MPI, it was realized that parallel I/O was needed and the standards body that controls MPI added a standard for parallel I/O called MPI-IO.
Some of the features of MPI-IO allow small writes or reads to be consolidated into larger writes or reads, and allow the high-speed communications network to deal with this consolidation. All of this works if you are using the MPI programing model. It does not work without MPI, which means that this will never work because MPI is used only in HPC application environments and not in databases or search engines.
A Modest Proposal
I believe we have a chicken-and-egg scenario with parallel applications, outside of HPC implementing applications with local I/O, because parallel I/O to parallel file systems is far too slow. We also, equally importantly, have no standard parallel I/O interface that these applications can use and that parallel file system vendors can apply for various levels of optimization.
We need to completely re-think I/O in general and, unfortunately, I believe that the ANSI T10 OSD standard, which provided a reasonably good start at an updated framework for I/O, is basically dead as an industry standard. It just was released at the wrong time and the storage industry has wrongly, in my opinion, ignored the importance of re-thinking I/O. Even if ANSI T10 OSD became an industry standard, we still do not have anyone discussing parallel I/O frameworks for languages. C++, C, Java, databases and other languages are stuck with the standard POSIX interface, with no chance of doing parallel I/O.
Admittedly, most search engines communicate to local file systems not only because there is no parallel I/O standard but also because of the cost of parallel file systems in terms of hardware, but many of these search engines solve the node failure problem by using multiple replicas of each of the nodes. There are costs for the blades, local network, replication network, and power and cooling for this architecture, and I am not sure that these costs are well known and well understood as they compare to parallel file systems, and it does not matter because there is no standard.
So my challenge to the community (standards bodies, hardware vendors, database experts, language developers and others), is to develop a standard for parallel I/O for multiple languages that supports all of the features and functions of MPI-IO and more.
This new parallel I/O standard will also require updates to POSIX to allow application writes to pass information about the I/O to the file system in order to improve performance. This new eco-system will also require debuggers to allow users to debug their parallel I/O mistakes. This means that the OpenGroup, which controls POSIX, will have to make some major changes along with various ANSI committees and, likely, the IETF for NFS.
Everyone will have to work together so we can finally have a well-integrated I/O stack with rich interfaces that are standardized and can meet the demands of a variety of application environments -- from databases to search engines to HPC -- as all of these applications require, or could require, parallel I/O.
When I explained this vision to someone recently they asked if it was time to see a doctor as I could benefit from some anti-delusional medication. It is good to dream, but as I have said in the past, storage might become irrelevant if it does not become easier to use and scale to meet bandwidth requirements. Sadly, I think the chances of this happening are about the same as me winning PowerBall, and I do not buy tickets.
Henry Newman, CEO and CTO of Instrumental, Inc., and a regular Enterprise Storage Forum contributor, is an industry consultant with 29 years experience in high-performance computing and storage.
Follow Enterprise Storage Forum on Twitter.