Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
The results of our Linux file system fsck testing are in and posted, but the big question remains: What do the results tell us, what do they mean, and is the performance expected? In this article we will take a look at the results, talk to some experts, and sift through the tea leaves for their significance.
The Linux file system fsck test results article generated some comments and discussions that are addressed in this article. However, before we do so, let's review the reason for the testing and what we hoped to learn from it.
Almost a year ago, Henry Newman and I had a wonderful Cuban dinner and started talking about file systems and storage technology, particularly in Linux. We both want to see the Linux community succeed and thrive, but some of the signs of that happening were not very encouraging at the time. The officially supported file system limits from Red Hat were fairly small, with 100TB being the largest file system supported. We also talked about some of the possible issues and thought that one possible reason for the limitation was metadata scaling issues, particularly the amount of need time to complete a file system check (fsck).
Henry and I speculated that one possible reason for Red Hat imposing supported file system size limitations was because of the amount of time needed to perform an fsck. (Note: These are supported file system limitations, not theoretical capacities.) Consequently, we decided to do some testing on larger file systems, 50TB and 100TB, which are fairly large capacities, given the supported limits for a large number of files. Our initial goal was up to 1 billion files.
The original fsck test plan was to test both ext4 and xfs with a varying number of files from 10 million files to 100 million files and two capacities for each file system: 40TB and 80TB for xfs and 5TB and 10TB for ext4. The goal was to keep it below 16TB since that was the limitation when the first article was written.
The original source of hardware for testing could not give us access to test hardware due to various reasons, and it was many months before Data Direct Networks (DDN) provided extended access for testing (thanks very much, DDN!). The details of the fsck testing and the results are explained in great detail in the previous article. Some of the details of the testing were changed from the original plan due to changes in the hardware and changes in the software.
Just to reiterate, our goal was to really test the fsck rate of the file systems. We were very curious about how quickly we could perform a fsck. Consequently, we filled a file system using fs_mark with a specified number of directories (only one layer deep), a specified number of files, and a specified file size. This tool has been used in other fsck studies (see subsequent section). Then, the file systems were unmounted, and the respective fsck was run. Since there was no damage to the file system, it was expected that time to complete a file system check would be as short as possible (i.e., the fastest possible metadata rates).
The reason we chose to perform the fsck testing in this manner was that artificially damaging or fragmenting the file system in some manner is arbitrary. That is, the results would apply only to that specific details of the damage or fragmentation process. This would tell us very little about the metadata performance in an fsck context of these file systems for other cases. If you will, the testing would tell the best possible fsck time (shortest time).
Since interpreting these results had much to do with the inner workings of some of the fsck tools, I reached out to David Chinner, one of the lead developers for xfs. He also happens to be employed at Red Hat and is an all-round file system kernel guy. He seemed the ideal person to contact for help in interpreting the results.
Analysis of FSCK Results
The results presented in the previous article were just the raw results of how much time it took to complete the file system check. I will be examining the results in the previous article except for the case labeled "fragmented" because the results for this case looked strangely out of line with the rest of the results (an outlier). After discussing it with David Chinner, I decided to drop that case from further analysis.
One obvious question this data raises is, how many files per second did the file system check take? Table 1 reproduces the data from the fsck test results article, but beneath the fsck times the number of files-per-second touched during the fsck are shown in red. Recall that the testing used CentOS 5.7 and a 2.6.18-274.el5 kernel.
Table 1: FSCK times for the various file system sizes, number of files, and for xfs and ext4 file systems. The fsck rate is shown below the fsck times.
|File System |
|Number of Files |
(Million of files)
|XFS - xfs_repair |
|ext4 - fsck time |
The fastest fsck rate is for the case with 51 million files and a 38TB xfs file system (191,729.3 files/s). The slowest rate is for the case of 10.2 million files and a 72TB ext4 file system (10,493.8 files/s).
In looking at the data, I have made some general observations about the results.
- For these tests you can easily see an order of magnitude difference in the rate of files processed during the file system checked.
- The fsck for ext4 is slower than for xfs.
- In general, for this small number of tests, the rate of files processed during the fsck for ext4 improved as the number of files increased. For xfs, the trend is not as consistent, but overall, as the number of files increased, the rate of files processed during the fsck for xfs generally improved.
- All of the file system checks finished in less than four hours (an unwritten goal of the original study).
During most of the fsck tests, the server did not swap. This was checked at various times during the fsck using "vmstat." However, for the case of 415 million files on the 72TB xfs file system, it does appear that the server did swap at some point, and the checks did miss the swapping. This is evident because of the large drop in fsck rate performance compared to the other cases.
Dave Chinner suggested another way to check for trend in the data -- looking at the file rate for the same inode count but different file system size for xfs. This data is shown in Table 2.
Table 2: Difference in File Processing Rate for XFS for the Two File System Sizes.
|File System Size|
Dave's comments about the results are that as the file system size was decreased by roughly 50 percent (from 72TB to about 38TB), the number of files processed is decreased by about 50 percent for larger number of files. The larger file systems has 50 percent more allocation groups (AG), which results in inodes being spread over a 50 percent larger physical area. This larger physical area containing inodes means that the average seek time to read the inodes increases. Hence, the processing rate goes down due to the longer IO latencies. That means the overall change in file rate isn't surprising. In David's words, "Large file systems mean more locations that spread across, which means more seeks to read them all ..."
The same data for ext4 is in Table 3 below:
Table 3: Difference in File Processing Rate for XFS for the Two File System Sizes.
|File System Size|
Notice that the difference in the file rate for the 72TB and 38TB file system sizes are roughly the same. That is because ext4 reallocates the inode space in known areas and should use them in exactly the same pattern with the higher regions not being used at all in any of the configurations because we used only 50 percent of the file system capacity, and there was no fragmentation.