NFS and Parallel File Systems: Performance Analysis - Page 2
NFS vs. a parallel file system for metadata
Metadata activity is another big area of difference and is often why some customers look to parallel file systems. In NFS file systems, metadata performance is often the bottleneck. With an NFS mounted file system, a RPC getattr request is made to get the file attributes. For a good picture of what happens with NFS metadata see page two of the following PDF.
As you can see, the RPC has to be done and the information passed back. This is not much different than what happens with parallel file systems. The main difference is that parallel file systems were designed to support billions of files with high performance metadata access as part of the original design.
This is far different than most NFS servers and their design points. Most of these designs were done for hundreds of millions of files maximum—not billions of files. The underlying NFS protocol does not support some of the features that are available in parallel file system, for example, to do an ls –l (e.g. stat() ) of a directory with 500,000 files. This is not to say that this is going to be done in one second even if things are cached, but most of the parallel file systems support doing stat() calls from a client at least at 30,000 stat() calls per second.
Open/create performance is another huge difference with at least 25,000 open/creates per second capability available on parallel file system and far less available via NFS. And unlink/remove has the same ratio of performance compared to NFS systems. The NFS protocol was not designed for the kinds of performance that is required by large environments. Combine that with the fact that metadata is not always in sync with NFS as part of the design (you can, of course, tune for this to reduce the client caching and make things more synchronized but at a performance price), while parallel file systems have their metadata in synchronization as part of the design. Yes, the inode update of atime (access time) might be out of sync a bit, but other than that, client caching is not. Also, these file systems are POSIX-compliant, unlike file systems over NFSv3, which is what most of us use.
What needs to be done
The real issue is that what happen on an NFS client, and what this translates to on an NFS server is about the same. Tuning issues on the client, other than some metadata caching and setting the read and write size to the largest values the servers support, is about all you can do to change the behavior from the client-side. Looking at the server side and tuning from that perspective is what you need to do.
For a parallel file system what needs to be done is understand the application I/O request size by using strace(1) which traces system calls and signals and see what the application I/O request sizes are. Are they big or small, aligned or unaligned? Are they using system calls or standard I/O (fopen/fread/frwrite)? Last but certainly not least can the application I/O request size be changed to be larger? Can you modify the application with code or input deck changes? Do different input cases have different I/O request sizes?
The keys here are to understand what you can do and what you cannot do and to understand the range of request sizes for all the applications that makeup a majority of your workload. Tuning parallel file systems could be as simple as setting the allocation size to match a majority of your request sizes or setting in the job run script size to map the application allocation on the various server nodes. Each of these methods will work depending on which file system is selected.
When I was a kid, my mom always used to tell me that you cannot put a square peg in a round hole. But I would always respond, "Yes you can, you pound it in." What I learned as an adult is that you can pound it in, but you get lots of splinters.
My point is that you cannot treat performance analysis and workload characterization the same way on NFS as you would on a parallel file system. If you use the same methods and then try to architect a parallel file system solution, you could be likely sorely disappointed in the performance.
I am seeing more and more environments wanting larger and larger namespaces, and I do not see how NFS based file storage systems are going to scale to meet the requirements. Therefore, I expect to see more usage of parallel file systems than we see today replacing some of the larger NFS environments.