Our examination of the ever-growing Linux file system scaling problem continues. In part 2 of our State of File Systems Technology series, Jeff Layton describes the approach and specs to be used in running the fsck wall clock time benchmark/test.
Building on Henry's Problem Statement, this article will present the test plan for performing fsck tests on Linux file systems. The goal is to test fairly large file systems that might be encountered on large systems to determine the status of file system check (fsck) performance. We ask and appreciate your feedback on the test plan.
Testing storage systems or any aspect of IT systems is definitely not an easy task. It takes careful planning, testing and hardware for proper benchmarks. Even if we are trying to be careful, it can be easy to forget, omit (either by design or as an accident), or misconfigure systems and benchmarks. Hence, the results are, unfortunately, less useful and maybe don't meet the original requirements. Henry and I often call these Slimy Benchmarking Tricks (SBTs). The end result is that good tests or benchmarks are difficult to do well. Perhaps as a consequence, much of the benchmarking we see today is of very poor quality, to the degree that it is virtually useless and more often than not, entertaining (and sometime frustrating).
Even if the benchmarks are done well, there is still the problem of correlating the benchmarks/tests to your application workload. This is true for computing-oriented benchmarks, such as taking something like H.264 encoding tests and determining how the benchmarks correlate to your weather modeling applications. This is also true for storage benchmarks. How do Postmark results correlate to an MPI-IO application that is doing astrophysics simulation? Or how do IOR results correlate to my database performance? The answer is as simple as it is nebulous--it depends.
There is no magic formula that tells you how to correlate benchmarks to real application workloads and more specifically, your application workload. The best predictor of your application workload's performance is, believe it or not, your application workload. However, it's not always possible to test your workload against storage solutions that range in terms of hardware, networking, file systems, file system tuning, clients, OS and so on. This is why we rely on benchmarks or tests to as an indicator of how our workload might perform. Typically, this means you have to take these benchmarks, run them on your existing systems, and compare the trends to the trends of your application workloads.
For example, you could take your existing systems and run a variety of benchmarks/tests against them--IOR, IOzone, Postmark and so on--and run your workloads on the same systems. Then, you can compare the two sets of results and look for correlation. This might tell you which benchmark(s) track closest to your application, indicating which benchmark/test you should focus on when you look for data about new hardware or new file systems. But this task isn't easy and it takes time--time we usually don't have. However, to effectively use benchmarks and tests we need to understand this correlation and how it affects us. Otherwise it's just marketing information.
Keeping these ideas in mind, our goal is to examine the fsck (file system check) performance of Linux file systems by filling them with dummy data, and then executing an appropriate file system check. This article describes the approach and the details we will be using in running these tests. Fortunately, in our case, the benchmark/test is fairly simple--fsck wall clock time-- so this should make our lives, and yours, a bit easier.
The following sections go over the details of the testing. Please read them carefully, and we encourage your feedback on the test plan with suggestions/comments.
I've written elsewhere about benchmarking storage systems. We will try to adhere to the tenants presented in the article and be as transparent as possible. However, if you have any questions or comments, we encourage you to post them.
The basic plan for the testing only has three steps:
That's pretty much it--not too complicated at this level, but the devil is always in the details.
Since our goal is to understand the fsck time for current Linux file systems, we will run several tests run to develop an understanding of how the fsck performance scales with the number of files and file system size. We'll test both XFS and ext4 for three values of the number of files: 1) 100 million files, 2) 50 million files and 3) 10 million files.
According to the Red Hat documents Henry previously mentioned, XFS is supported to 100TBs. The testing hardware we have access to limits the total file system to about 80TBs, formatted (more on that in the next section). We'll also test at 40TBs (half that size). For testing ext4, the same Red Hat document says that only a 16TB file system is currently supported. To prevent running up against any unforeseen problems, we'll test with a 10TB file system and a 5TB file system.
An fsck for ext4 for both of these file system sizes should not take a long to run, since there are very few spindles and a large number of files. Consequently, we will run these tests last.
Overall, the fsck tests that are to be run are on the following combinations.
40TB XFS File System
10 TB ext4 File System
5TB ext4 File System
For each of these combinations, three basic steps will be followed--
For ext4, we will use fsck.ext4 to execute the file system check. For XFS, we will use xfs_repair (Note: xfs_check just walks the file system and doesn't do repairs. We want to run tests using the same commands an admin would be using, which is xfs_repair.).
One of the key pieces in the testing is how to fill the file system. The tool we will be using is called fs_mark. It was developed by Ric Wheeler (now at Red Hat) to test file systems. Fs_mark will test various aspects of the file system performance, which is interesting, but not the focus of this test. However, in running the tests, fs_mark will conveniently create the file system, which is what is needed.
Using fs_mark, the file system is filled and tested. There are a large number of options for fs_mark, but we will focus on only a few of them. An example of command line for creating 100 million files is the following:
# fs_mark -n 10000000 -s 400 -L 1 -S 0 -D 10000 -N 1000 -d /mnt/home -t 10 -k
where the options are:
With these options, there are a total of 1,000 files per directory, and there are 10,000 directories. This results in a total of 10 million files. However, note that the number of files specified by the "-n" option lists only 10 million files because each thread will produce "-n" files. Since we have 10 threads and we have 10 million files per thread, this results in a total of 100 million files.
Since we have 100 million files and each file is 400KB, the file system uses a total of 40TBs. This is about half of the 80TBs for the largest file system. With the goal of filling at least 50 percent of the file system for the specified number of files, the resulting file sizes are listed below.
40TB XFS File System
10 TB ext4 File System
5TB ext4 File System
All of the tests except the last one use 50 percent of the space. The last one uses 60 percent of the space.
When possible, we will repeat the tests several times so we can report the average the standard deviation. In between tests, the file system will be remade and the fs_mark will be rerun to fill the file system. Due to the possibly large amount of time to fill the file system and run fsck, it is possible that only a few tests will be run.
Dell has been kind enough to supply us with hardware for the testing. The hardware used is its NSS solution that uses xfs. The configuration consists of a single NFS gateway with two Intel 4-core processors, either 24 or 48 TBs of memory, and two 6Gbps RAID cards that are each connected to a daisy-chained series of JBODs. The JBODs have twelve 3.5" drives. The drives used are 7.2K rpm, 2TB NL-SAS drives. Each JBOD uses RAID-6 across 10 of the drives for actual capacity, leaving the other two drives for parity. So each JBOD provides 20TBs of capacity. The RAID cards use RAID-60 on their particular set of JBODs (RAID-6 within each JBOD and RAID-0 to combine them). Then LVM is used to combine the capacity of the two RAID cards into a single device used for building the file system.
For the 80TB configuration, a total of 48 drives is used (40 for capacity and 8 for parity). For the 40TB-configuration, a total of 24 drives are used (20 for capacity and 4 for parity). The smaller configurations used in the ext4 testing just use less of capacity of the 40TB configuration using LVM.
Summary and Invitation for Comments
This article is really the test plan that executes the ideas embodied in Henry's article. The focus is on testing the fsck time for Linux file systems, particularly xfs and ext4, on current hardware. To gain an understanding of how fsck time varies, several file system sizes and number of files will be tested.
We will be using fs_mark to fill the file systems and then just run the appropriate file system check and time how it long it takes to complete. It's a pretty straightforward test that should give us some insight into how fsck performs on current Linux file systems.
We want to encourage feedback on the test plan. Is something critical missing? Is there perhaps a better way to fill the file system? Is there another important test point? Speak now or forever hold your peace.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.