HPC file systems offer many potential benefits, but the technology needs some important features before it will be ready for widespread enterprise adoption.
I recently saw the slides that Fujitsu is using for the Hot Chips conference and noted that Fujitsu is using the Lustre file system in its planned exascale project which will be competing with U.S. exascale plans. A number of fairly large companies, actually many leaders in our industry, are working on large or parallel file systems. Some of these industry storage leaders that are working on parallel file systems solutions include Intel, EMC, Seagate, Hitachi, NetApp all with the Lustre file system and IBM with their GPFS file system.
So what do these high-performance computing (HPC) file systems have to do with you and why should you care?
The industry is rapidly moving to REST interfaces, but there are still some limitations on using REST. However, because REST protocol is not as rich as the POSIX framework, some current applications that have to rewrite data with system calls for writing data that has already been written.
Also, though it is really an application-specific file system, many people are using HDFS. The problem is, in my opinion, that HDFS is very limited in what can do beyond supporting MapReduce. HDFS is good a large block I/O, but other than that, it is pretty limited.
In addition, we have at least three decades of older applications that will need to be rewritten for new file system interfaces, such as those with a REST interface. Making the transition and porting code is going to take many billions of dollars and is not going to happen overnight.
Many current NFS- and CIFS-based NAS systems lack scaling for both data and metadata. Some might support a petabyte or two or even 8 or maybe 100 million files, but what currently supported NFS- or CIFS-based NAS system supports more than 1 billion files, 1 TB/sec of sustained bandwidth and 50 PB of storage space? There are a few that might be able to do the 50 PB or and might even do the 1 billion+ files, but not a single one can meet the bandwidth numbers. And bandwidth performance is important, as is having all of the files in a single namespace.
In order to use the system efficiently, scaling bandwidth performance with the number of PB is important. There are NAS systems that might be able to look up thousands of clients via NFS, but do they operate as efficiently with tens or hundreds of clients as they do with thousands? NFS and CIFS are limiting factors given how the protocols work and how much CPU and overhead they require.
Parallel file systems do not use NFS or CIFS but instead have native optimized client interfaces that allow them to scale the performance and efficiently use more than ten thousand disk drives, getting a high percentage of bandwidth from each drive. Supporting 10, 20 or 30 thousand drives is great, but you also need a file system and, of course,the underlying hardware need to scale nearly linearly with drive, controller and client count.
If you think you can solve the performance issue by using flash rather than disk drives then think again because file system scalability problems cannot be completely solved with flash.
File system developers need to address allocation issues, metadata consistency issues and data streaming issues. The streaming I/O issues can be solved with faster storage, but locking metadata and allocations are a function of design. They might be sped up somewhat with improved hardware, but hardware cannot solve the underlying problems.
Namespace management is not fun for administrators and is costly in terms of the overhead of managing lots of file systems. Supporting more than a billion files is no easy task, given how most if not all NAS file systems are designed. Additionally, NFS and CIFS were not designed to efficiently support say 50,000 open/create system calls for new files nor say 200,000 stat() system calls per second, all at the same time.
HPC file systems are designed with consistency in mind. You have had the experience writing data from one NFS or CIFS mount and trying to read from another. Because of the performance limitations of NFS and CIFS, most systems administrators use caching to improve the performance, which is fine if only one client is accessing a file or for read only. If you are doing reads and writes to the same file, then this becomes a problem. HPC file systems solve the problem; NFS and CIFS do not. With REST it is unclear to me what happens in the protocol if there are reads and rewrites to the same file.
Though HPC file systems might have some or many of these features, today's high-end NAS boxes today have some features today's HPC systems do not:
HPC file systems and the scalability they provide for both data streaming IOPs and metadata are a good model for the requirements for new storage technologies. These file systems can do hundreds of thousands of metadata operations per second and stream many TBs of I/O per second.
On the other hand, enterprise NAS systems supports all kinds of applications, can dedup data and are generally closer to meeting enterprise requirements than most HPC file systems, given their feature set and the original design goals. Clearly, given the amount of code and years of investment in applications, we are not going to be able to flip the switch and presto chango have everyone run on REST systems. It is likely going to take the rest of the decade and then some to be able to switch over, but in the mean time we have a scaling issue.
If the HPC file system community decides to invest in some of the enterprise features listed above, there is a good size market for these file systems. But these file systems will be challenged by the enterprise requirements listed above. Deciding which requirements to tackle first is going to be a challenge as different markets likely have higher priority on different requirements.
I suspect the HPC file systems vendors are going to be looking for new markets soon given their extreme scalability.
Photo courtesy of Shutterstock.