I was recently at a customer site working on a problem they had. This group was a customer of a large NAS vendor (no vendor names used in my articles) and had many PB of NFS attached storage. They were looking at potentially purchasing a new parallel file system, and part of my job was to help them characterize the workload. The two leading parallel file systems today are GPFS from IBM and Lustre which is supported by a number of vendors and is open source.
Characterizing the workload, of course, sounds pretty easy, and of course, it is not.
The way this group did performance analysis was to look at the statistics from the NFS server statistics. If you are buying an NFS server, that is likely a good approach, but if you are moving up the food chain and looking at purchasing a parallel system to meet the scalability requirements that cannot be archived with NFS, then think again.
A different approach was going to be needed for some good reasons.
The key difference between NFS and a parallel file system is that what happens in the data path is totally different. So the performance analysis techniques that you might have used with NFS are not the right techniques for a parallel file system.
The NFS path eventually goes over the TCP to the NFS server, and yes, you could use UDP but given the reliability, I am not sure if anyone does that any more. For NFSv2 or NFSv3, the default values for both parameters is set to 8192 bytes. For NFSv4, the default values for both parameters is set to 32768. You can set these values even larger, but the values will be negotiated by the server to the maximum the server supports. So setting them at say 1,048,576 might give you the warm comforting feeling that you can do 1 MiB I/O requests to your NFS server, but you might be making 16384 byte requests because that is all the server supports.
In some ways, this is not any different than a parallel file system, as the client might only be allowed to do I/O requests in the allocation size of the parallel file system. The big difference is that allocation sizes for parallel file systems are generally bigger than what is supported and negotiated by the NFS server.
There are a few key issues:
- Request sizes
- NFS vs. a parallel file system for metadata
- What needs to be done on the NFS client to understand what will happen on a parallel file system
As I have described, request sizes from the client to the NFS file system can be very different than what might be seen on a parallel file system. Big requests are important for disk drives to operate efficiently.
Here is an example of what happens if each I/O of the size on the left is followed by a random seek and latency followed by another I/O of the size. The columns on the right show the disk drive efficiency. I could only get this for Seagate drives as other vendors do not publish detailed information, and I have only shown you two drive types as that is all that will fit.
Clearly the I/O sizes for NFS—even default NFSv4 sizes—are horrible with less than 4 percent utilization, even with 2.5-inch 15K RPM drives. Of course, on the NFS server side there is I/O being coalesced into larger requests, but that takes work on the server side. To get I/O from the same file together on the disk requires lots of cache and therefore expense if you have many clients.
For a parallel file system I/O requests made to the file system servers are generally the size of the I/O request from the application, which for the two largest file systems in terms of market share can be over 1 MiB and up to in one case 16 MiB.
The bottom line is that a parallel file system will allow larger requests if the application can be changed or already makes larger requests than NFS.
NFS vs. a parallel file system for metadata
Metadata activity is another big area of difference and is often why some customers look to parallel file systems. In NFS file systems, metadata performance is often the bottleneck. With an NFS mounted file system, a RPC getattr request is made to get the file attributes. For a good picture of what happens with NFS metadata see page two of the following PDF.
As you can see, the RPC has to be done and the information passed back. This is not much different than what happens with parallel file systems. The main difference is that parallel file systems were designed to support billions of files with high performance metadata access as part of the original design.
This is far different than most NFS servers and their design points. Most of these designs were done for hundreds of millions of files maximum—not billions of files. The underlying NFS protocol does not support some of the features that are available in parallel file system, for example, to do an ls –l (e.g. stat() ) of a directory with 500,000 files. This is not to say that this is going to be done in one second even if things are cached, but most of the parallel file systems support doing stat() calls from a client at least at 30,000 stat() calls per second.
Open/create performance is another huge difference with at least 25,000 open/creates per second capability available on parallel file system and far less available via NFS. And unlink/remove has the same ratio of performance compared to NFS systems. The NFS protocol was not designed for the kinds of performance that is required by large environments. Combine that with the fact that metadata is not always in sync with NFS as part of the design (you can, of course, tune for this to reduce the client caching and make things more synchronized but at a performance price), while parallel file systems have their metadata in synchronization as part of the design. Yes, the inode update of atime (access time) might be out of sync a bit, but other than that, client caching is not. Also, these file systems are POSIX-compliant, unlike file systems over NFSv3, which is what most of us use.
What needs to be done
The real issue is that what happen on an NFS client, and what this translates to on an NFS server is about the same. Tuning issues on the client, other than some metadata caching and setting the read and write size to the largest values the servers support, is about all you can do to change the behavior from the client-side. Looking at the server side and tuning from that perspective is what you need to do.
For a parallel file system what needs to be done is understand the application I/O request size by using strace(1) which traces system calls and signals and see what the application I/O request sizes are. Are they big or small, aligned or unaligned? Are they using system calls or standard I/O (fopen/fread/frwrite)? Last but certainly not least can the application I/O request size be changed to be larger? Can you modify the application with code or input deck changes? Do different input cases have different I/O request sizes?
The keys here are to understand what you can do and what you cannot do and to understand the range of request sizes for all the applications that makeup a majority of your workload. Tuning parallel file systems could be as simple as setting the allocation size to match a majority of your request sizes or setting in the job run script size to map the application allocation on the various server nodes. Each of these methods will work depending on which file system is selected.
When I was a kid, my mom always used to tell me that you cannot put a square peg in a round hole. But I would always respond, "Yes you can, you pound it in." What I learned as an adult is that you can pound it in, but you get lots of splinters.
My point is that you cannot treat performance analysis and workload characterization the same way on NFS as you would on a parallel file system. If you use the same methods and then try to architect a parallel file system solution, you could be likely sorely disappointed in the performance.
I am seeing more and more environments wanting larger and larger namespaces, and I do not see how NFS based file storage systems are going to scale to meet the requirements. Therefore, I expect to see more usage of parallel file systems than we see today replacing some of the larger NFS environments.