I recently saw the slides that Fujitsu is using for the Hot Chips conference and noted that Fujitsu is using the Lustre file system in its planned exascale project which will be competing with U.S. exascale plans. A number of fairly large companies, actually many leaders in our industry, are working on large or parallel file systems. Some of these industry storage leaders that are working on parallel file systems solutions include Intel, EMC, Seagate, Hitachi, NetApp all with the Lustre file system and IBM with their GPFS file system.
So what do these high-performance computing (HPC) file systems have to do with you and why should you care?
The Problem with REST and HDFS
The industry is rapidly moving to REST interfaces, but there are still some limitations on using REST. However, because REST protocol is not as rich as the POSIX framework, some current applications that have to rewrite data with system calls for writing data that has already been written.
Also, though it is really an application-specific file system, many people are using HDFS. The problem is, in my opinion, that HDFS is very limited in what can do beyond supporting MapReduce. HDFS is good a large block I/O, but other than that, it is pretty limited.https://o1.qnsr.com/log/p.gif?;n=203;c=204655439;s=10655;x=7936;f=201806121855330;u=j;z=TIMESTAMP;a=20400368;e=i
In addition, we have at least three decades of older applications that will need to be rewritten for new file system interfaces, such as those with a REST interface. Making the transition and porting code is going to take many billions of dollars and is not going to happen overnight.
Why HPC File Systems and What Is Missing from Network Storage
Many current NFS- and CIFS-based NAS systems lack scaling for both data and metadata. Some might support a petabyte or two or even 8 or maybe 100 million files, but what currently supported NFS- or CIFS-based NAS system supports more than 1 billion files, 1 TB/sec of sustained bandwidth and 50 PB of storage space? There are a few that might be able to do the 50 PB or and might even do the 1 billion+ files, but not a single one can meet the bandwidth numbers. And bandwidth performance is important, as is having all of the files in a single namespace.
Scale Up Performance
In order to use the system efficiently, scaling bandwidth performance with the number of PB is important. There are NAS systems that might be able to look up thousands of clients via NFS, but do they operate as efficiently with tens or hundreds of clients as they do with thousands? NFS and CIFS are limiting factors given how the protocols work and how much CPU and overhead they require.
Parallel file systems do not use NFS or CIFS but instead have native optimized client interfaces that allow them to scale the performance and efficiently use more than ten thousand disk drives, getting a high percentage of bandwidth from each drive. Supporting 10, 20 or 30 thousand drives is great, but you also need a file system and, of course,the underlying hardware need to scale nearly linearly with drive, controller and client count.
If you think you can solve the performance issue by using flash rather than disk drives then think again because file system scalability problems cannot be completely solved with flash.
File system developers need to address allocation issues, metadata consistency issues and data streaming issues. The streaming I/O issues can be solved with faster storage, but locking metadata and allocations are a function of design. They might be sped up somewhat with improved hardware, but hardware cannot solve the underlying problems.
Namespace management is not fun for administrators and is costly in terms of the overhead of managing lots of file systems. Supporting more than a billion files is no easy task, given how most if not all NAS file systems are designed. Additionally, NFS and CIFS were not designed to efficiently support say 50,000 open/create system calls for new files nor say 200,000 stat() system calls per second, all at the same time.
HPC file systems are designed with consistency in mind. You have had the experience writing data from one NFS or CIFS mount and trying to read from another. Because of the performance limitations of NFS and CIFS, most systems administrators use caching to improve the performance, which is fine if only one client is accessing a file or for read only. If you are doing reads and writes to the same file, then this becomes a problem. HPC file systems solve the problem; NFS and CIFS do not. With REST it is unclear to me what happens in the protocol if there are reads and rewrites to the same file.
What's Missing from HPC
Though HPC file systems might have some or many of these features, today's high-end NAS boxes today have some features today's HPC systems do not:
- Application support- Any file system needs support from everything from databases to VMware to cloud applications. Just having support for HPC applications reading input data or writing checkpoints is not going to cut it.
- Replication-File systems also need the ability to replicate a file or block of data, including the metadata, such that policies for having data offsite are met.
- Data deduplication and compression-Because everything goes through a single entry point with NAS boxes, doing data deduplication is pretty easy. With parallel file systems, this becomes a huge issue as the data is spread out across many different targets. If different clients write, they might be start writing across different targets, so you will not see the same data starting on that target and data deduplication will not be found. This is the inherent advantage of HPC file systems in terms of spreading out the load, but also the downside. Compression is possible on either the client on file system target.
- Failover-It is far easier to failover NAS systems and REST-based systems than HPC file systems given the complexity of the data paths. You have tens or evens hundreds of storage targets, as well as metadata to deal with. Combine that with the number of requests in flight for read, writes and metadata operations and compare and contrast that with NAS or REST systems and it is obvious why failover is significantly more difficult.
- Resiliency-The issues around the difficulty of doing failover and data deduplication are much of the reason resiliency is difficult for HPC file systems
- Tiering-The NAS and REST world are ahead of most HPC file systems for moving data between tiers. Of course, the system that does this best and actually does most of this list best is the venerable IBM mainframe running MVS.
- Monitoring-Monitoring performance and system health is lacking in most HPC file systems and from what I have seen in many REST systems compared to what I have seen in the NAS world.
- Management-System management is not as easy as it seems, as you have to manage the whole system including the storage and the file system and network. Given the degree of difficulty it is not something that you do overnight. The NAS vendors have many years head start.
- Hot everything-The time and effort the NAS vendors have put into easy upgrades make this a clear win for them. Software-only REST vendors and poorly integrated HPC file vendors have a long way to go.
What Does the Future Hold for File Systems?
HPC file systems and the scalability they provide for both data streaming IOPs and metadata are a good model for the requirements for new storage technologies. These file systems can do hundreds of thousands of metadata operations per second and stream many TBs of I/O per second.
On the other hand, enterprise NAS systems supports all kinds of applications, can dedup data and are generally closer to meeting enterprise requirements than most HPC file systems, given their feature set and the original design goals. Clearly, given the amount of code and years of investment in applications, we are not going to be able to flip the switch and presto chango have everyone run on REST systems. It is likely going to take the rest of the decade and then some to be able to switch over, but in the mean time we have a scaling issue.
If the HPC file system community decides to invest in some of the enterprise features listed above, there is a good size market for these file systems. But these file systems will be challenged by the enterprise requirements listed above. Deciding which requirements to tackle first is going to be a challenge as different markets likely have higher priority on different requirements.
I suspect the HPC file systems vendors are going to be looking for new markets soon given their extreme scalability.
Photo courtesy of Shutterstock.