Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
To determine if these file rates are within the ballpark or not, Red Hat was contacted for comment. Dave Chinner, one of the lead developers for xfs, had a few comments about the test results. When asked about the general result as to whether they were plausible, David said the following.
... the numbers are entirely possible. 600,000 inodes/s only requires about 1GB/s of IO throughput to achieve, and the DDN you tested on it is more than capable of this ...
Xfs_repair does extensive readahead itself, and some of the methods it uses are very effective on large RAID arrays so I would expect it to be faster than e2fsck for large scale file systems ...
Then we asked him what he thought about the results when comparing xfs to ext4 in terms of fsck performance. In our results, xfs_repair was about 2-8x faster than ext4 performance, while in some of the talks from Ric Wheeler mentioned previously xfs_repair was anywhere from 9x to 40x faster than ext4. Dave had this to say,
The difference in speed with xfs_repair depends on the density and distribution of the inodes and directory metadata. When you have zero length files, metadata is very dense, and xfs_repair will tend to do very large IOs and run hardware bandwidth (not iops) speed and be CPU bound processing all the incoming metadata.
The basic optimization premise that if the metadata is dense enough, we do a large IOs reading both data and metadata, and then chop it up in memory into metadata buffers for checking, throwing away the data. This is based on the observation that it takes less IO and CPU time to do a 2MB IO and chop it up in memory than it does to seek 50 times to read 50x4k blocks in that 2MB window.
For less dense distributions (like with your larger files), the amount of IO per inode or directory block increases, and therefore the speedup from those optimizations is not as great. In most aged file systems, however, the metadata distribution is quite dense (it naturally gets separated from the data) and so in general those optimizations result in a good speedup compared to reading metadata blocks individually.
When asked if the file system times and file process rate looked good, Chinner responded:
Yes, they are in the ballpark of what I'd expect. The latest version of xfs_repair also has some more memory usage reductions and optimizations that might also help improve large file system repair performance.
When asked about estimates of performance, David stated that with the hardware from DDN that was used, we could reach about 600,000 inodes/s or about 1 GB/s. He explained the 1 GB/s estimate.
It was a rough measurement based on typical inode densities I've seen fs_mark-like workloads produce.
By default, inodes are 256 bytes in size, packed into contiguous chunks of 64 inodes. So, it takes a 16k IO to read a single chunk of 64 inodes. If we have to read 10,000 inode chunks (640,000 inodes), it should only require reading 160MB of metadata. So the absolute minimum bandwidth from storage to 600,000 inodes/s from disk is around 160MB/s.
But inodes typically aren't that densely populated because there will often be directory and data blocks between inode chunks. So, if we have a 50 percent inode chunk density, xfs_repair will do large reads and discard the 50 percent of the space it reads (i.e., stuff that isn't metadata). Now we are at 320MB/s.
If we have typical small file inode densities, we'll be discarding about 85 percent of what we read in. So at a 1GB/s raw data read rate, we'd be pulling in roughly 150MB/s of inodes, or roughly 600,000 inodes/second ...
Now if we were reading those inodes in separate IOs, we'd need to be doing roughly 20,000 IOPS (inodes are read/written in 8k cluster buffers, not 16k chunks). This is the effect of the bandwidth vs IOPS trade-off we use to speed the reading of inodes into memory.
When asked for some final comments, Dave said,
e2fsck doesn't optimize its IO for RAID arrays. Its performance comes from being able to do all its metadata IO sequentially because it is mostly in known places (inodes, free space, etc). XFS dynamically allocates all its metadata, so it needs to be more sophisticated to scale well.
Also, you might want to try the ag_stride option to xfs_repair to further increase parallelism if it isn't already IO or CPU bound. That can make it go quite a bit faster.
It is probably also worth checking to see if you have enough memory for xfs_repair to cache all its metadata in memory. The same metadata needs to be read in phase 3, 4 and 5, so if it can be cached in phase 3, then phase 4 and 5 run at CPU speed rather than IO speed, and that can significantly improve runtime….
More information here from the talk I did all about this at LCA in 2008, specifically, slides 21 onwards show the breakdown of time spent in each phase as the number of inodes in the file system increases; slide 28 showing the effect of memory vs number of inodes; and slides 33 onwards showing the effect of ag_stride on performance on a 300M inode file system.
David also made a comment about our original suppositions that led to the article,
Repair scalability is not really an issue -- the problem is that finding the root cause of problems gets exponentially harder as file system size increases. So if you double your supported file size, expect to spend four times as much resources testing and supporting it. You can work out the business case from there ;)
Henry and I are both fans of Linux and want to see it succeed in every possible way, particularly in the HPC world where we both spend a great deal of time. We were disappointed to see that Red Hat supports Linux file systems only to 100TB. We knew that a number of key file system developers were working very hard to improve the scalability of Linux file systems. People such as Ric Wheeler, Dave Chinner, Eric Sandeen, Christoph Hellwig and Theodore Ts'o, just to name a few, were improving scalability of the major Linux file systems. Based on this supported limitations, we decided to do some testing around metadata performance as measured by a file system check.
The tests we developed are designed to be repeatable without being too specific to a particular fragmentation or file system damage pattern. The times may be on the optimistic side since no damage repair must be done, but it gives you something of an upper bound on file system check times. David Chinner's commented about this,
When there is damage, all bets are off. A file system that takes 15 minutes to check when there is no damage can take hours or days to repair when there is severe damage. Even minor damage can blow out repair times significantly. Not just the time it takes, but also the RAM required for repair to run to completion ...
Thus, trying to create "repeatable" file system repair tests is difficult at best.
The results indicated that the times to complete a file system check are within accepted norms. They also indicate that the metadata rates of xfs and ext4 are within what we call a "good range." The reason that they are "good" is that the file system checks can finish in less than a few hours, which is a very acceptable time for most admins.
We sincerely hope these are the first steps along the way toward better testing and development of Linux file systems. Developing tests that illustrate both the good and not so good aspects of file system behavior can only help the file systems get stronger. For example, the rather poor metadata performance of xfs drove Dave Chinner to focus on metadata development. We encourage vendors and the community to continue testing of the file systems, particularly larger scale testing since data never shrinks.
We hope to contribute to this testing as time allows, but if you are a vendor and have some hardware available for testing for a few weeks, we would love to collaborate or help in any way with testing.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.