Our Linux file system Fsck testing is finally complete. Just how bad is the Linux file system scaling problem?
After an extended delay, the Linux File System fsck testing results can now be presented. The test plan has changed slightly from our kickoff article previous article. We will review it at the beginning of the this article, followed by the actual results. Henry Newman will be reviewing the results and writing some observations in the next article in this series. As always we welcome reader feedback and comments.
It has been a while since we started the fsck project to test fsck (file system check) times on Linux file systems. The lengthy delay in obtaining the results is due to the lack of hardware for testing. The original vendor could not spare the hardware for testing. A number of other vendors were contacted and due to various reasons none of them could provide the needed hardware for many, many months if at all. In the end, Henry used his diplomatic skills to save the day, persuading Data Direct Networks to help us out. Paul Carl and Randy Kreiser from DDN contacted me and agreed to provide remote access to the hardware (thank you, DDN!).
Paul used a DDN SFA10K-X with 590 disks that are 450GB, 15,000 rpm SAS disks. He used a 128KB chunk size in the creation. From these disks he created a number of RAID-6 pools using an 8+2 configuration (8 data disks and 2 parity disks). Each pool is a LUN that is 3.6TB in size before formatting. The LUNS were presented to the server as disk devices such as /dev/sdb1, /dev/sdc1, /dev/sdd1, ..., /dev/sdx1 for a total of 23 LUNs of 3.6TBs each. This is a total of 82.8 TBs (raw). The LUNs were combined using mdadm and RAID-0 to create a RAID-60 configuration using the following command:
mdadm -- create /dev/md1 -- chunk=1024 -- level=0 -- raid-devices=23 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1 /dev/sdn1 /dev/sdo1 /dev/sdp1 /dev/sdq1 /dev/sdr1 /dev/sds1 /dev/sdt1 /dev/sdu1 /dev/sdv1 /dev/sdw1 /dev/sdx1
The result was a file system with about 72TB using "df -h" or 76,982,232,064 bytes from "cat /proc/partitions". A second set of tests were run on storage that used only 12 of the 23 LUNs. The mdadm command is,
mdadm -- create /dev/md1 -- chunk=1024 -- level=0 -- raid-devices=12 /dev/sdb1 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1 /dev/sdg1 /dev/sdh1 /dev/sdi1 /dev/sdj1 /dev/sdk1 /dev/sdl1 /dev/sdm1
The resulting file system for this configuration is about 38 TBs using "df -h".
The server used in the study is a dual-socket, Intel Xeon system with Nehalem processors (E5520) running at 2.27 GHz and an 8MB L3 processor cache. The server has a total of 24GB of memory, and it was connected to the storage via a Qlogic Fibre Channel FC8 card connected to an FC switch that was connected to the storage. The server ran CentOS 5.7 (2.6.18-274 kernel). The stock configuration was used throughput the testing except for one component. The e2fsprogs package was upgraded to version 1.42, enabling ext4 file systems larger than 16TB to be created. This allows the fcsk performance of xfs and ext4 to be contrasted.
Building the file systems was done close to the default behavior that many system admins will adopt -- using the defaults. The commands for building the file systems are:
Mounting the file systems involved a little more tuning. In the case of XFS, I used the tuning options as stated by Dell, XFS -- rw,noatime,attr2,nobarrier,inode64,noquota. In the case of ext4, the mounting options used are defaults,data=writeback,noatime,barrier=0,journal_checksum.
The journal checksum was turned on within ext4 since I like this added behavior.
One of the keys to the testing is how the file system is filled. This can be a very time consuming process because you must create all of the files in some sort of order or fashion. For this testing, fs_mark was used. Ric Wheeler at Red Hat has been using it for testing file systems at very large scales (over 1 billion files). Fs_mark wasn't used for testing the file system in this article, but rather, it is used to fill the file system in a specific fashion. It uses one or more base directories and then creates a specified number of subdirectories underneath them that are filled with files. You might think of this as a single-level of subdirectories. It is much more complicated to create specific subdirectory depths and number of files since that configuration depends on the specific users and situation. You could also use some sort of random approach with the hope that a random distribution approximates a real-world situation. It is virtually impossible to have a representative file system tree that fits most general situations, and the single-level deep directory tree used here should represent one extreme of file systems -- a single subdirectory level.
One of the nice features of fs_mark is that it is threaded so that each thread produces its own unique directory structure with a single layer of subdirectories underneath a base directory that contains a fixed number of files. Fs_mark also allows you to specify the number of files per thread so that you can control the total number of files. Although the server has eight total cores, running eight threads (one per core) it resulted in the OS swapping. When the number of threads is reduced to three, the server did not swap, and the file creation rate was much faster than running eight threads with swapping.
Using three threads causes some issues because it is an odd number. This made it impossible to determine an integer number of files per thread, as using the old file counts was not possible. The number of files per thread was changed to a reasonable integer number that is close to the original numbers of 100,000,000, 50,000,000, and 10,000,000. The numbers chosen were: 105,000,000, 51,000,000, and 10,200,000.
The goal for all fs_mark commands was to fill the file system to the specified number of files while filling about 50 percent of the file system. The following fs_mark command lines were used to fill the file system for 72TB:
The commands for filling the 38TB file systems were:
Notice that the number of files per directory is a constant (-N 1000 or 1,000 files).
After the file system was filled using fs_mark, it was unmounted, and the file system check was run on the device. In the case of xfs, the command is,
/sbin/xfs_repair -v /dev/md1
For ext4, the file system check was,
/sbin/e2fsck -pfFt /dev/md1
Notice that the device /dev/md1 was the target in both cases.
DDN was kind of enough to offer additional testing time so I decided to try some tests that stretched the boundaries a bit. The first test was to create an XFS file system with 415,000,000 files and filling about 40 percent of the file system on the 72TB file system. The second test was to try to increase the fragmentation of the file system by randomly adding and deleting directories using fs_mark for the 105,000,000 file case also on the 72TB file system.
For the first test where 415,000,000 files were created, the original goal was to test 520,000,000 files in five stages of 105,000,000 files (creating 520,000,000 all at once caused the server to swap badly). However, due to time constraints, only four of the five stages could be run (fs_mark ran increasingly slower the more files were on the system). The final number of files created was 420,035,002 which also includes all "." and ".." files on the directories.
For the second test, approximately 105,000,000 files were created on an XFS file system in several steps. A total of five stages were used where 21,000,000 files were added at each stage using fs_mark (a total of 105,000,000 files). In between the stages, a number of directories were randomly removed, and the same number of directories anf files were replaced using fs_mark on randomly selected directories. The basic process is listed below:
Because of the random nature of selecting the directories, it is possible to get some directories with many more files than others. However, the total number of files won't be 105,000,000 since of the random nature of selection for deletion and insertion. If we count all fo the files including the "." and ".." files we find that the process created 115,516,127 files.
The table below lists the file system repair times in seconds for the standard matrix cases as specified in the previous article but with the new number of files. These times include all steps in the file system checking process.
|File System Size (in TB)||Number of Files (in Millions)||XFS - xfs_repair time (Seconds)||ext4 - fsck time (Seconds)|
The FSCK time for the additional tests are listed below:
Notice that the 415,000,000 case took 6.95 times longer than the 105,000,000 file case even though it had four times as many files. During the file system check the server did not swap, and no additional use of virtual memory was observed.
The "fragmented" case is interesting because it took less time to perform the file system check than the one-level directory case. The original case took 1,629 seconds and the fragmented case took only 676 seconds -- about 2.5 times faster. Time did not allow investigating why this happened.
In the next article in this seriesHenry writing about his observations of the results. Please be sure to post your comments about these testing results.
At first glance it seemed simple a vendor could provide about 80TB to 100TB of raw storage connected to a server for testing, but this turned out not to be the case. It was far more difficult than anticipated. I would be remiss if I didn't thank the people who made this possible: Of course Henry Newman for pushing various vendors to help if they could. Thanks go to Paul Carl and Randy Kreiser from DDN who greatly helped in giving me access to the hardware and helped with the initial hurdles that crop up. Thanks also to Ric Wheeler who answered several emails about using fs_mark and about Linux file systems in general. He has been a big supporter of this testing from the beginning. Thanks also to Andres Dilger from Whamcloud who provided great feedback and offers of help all of the time.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.