Fixing SSD Performance Degradation, Part 2 Page 3
For IOzone the system specifications are fairly important since they affect the command line options. In particular, the amount of system memory is important because this can have a large impact on the caching effects. If the problem sizes are small enough to fit into the system or file system cache (or at least partially), it can skew the results. Comparing the results of one system where the cache effects are fairly prominent to a system where cache effects are not conspicuous, is comparing the proverbial apples to oranges. For example, if you run the same problem size on a system with 1GB of memory versus a system with 8GB you will get much different results.
For this article, cache effects will be limited as much as possible. Cache effects can't be eliminated entirely without running extremely large problems and forcing the OS to virtually eliminate all caches. But, one of the best ways to minimize the cache effects is to make the file size much bigger than the main memory. For this article, the file size is chosen to be 16GB which is twice the size of main memory. This is chosen arbitrarily based on experience and some urban legends floating around the Internet.
For this article, the total file size was fixed at 16GB and four record sizes were tested: (1) 1MB, (2) 4MB, (3) 8MB, and (4) 16MB. For a file size of 16GB that is (1) 16,000 records, (2) 4,000 records, (3) 2,000 records, (4) 1,000 records. Smaller record sizes took too long to run since they number of records would be very large so they are not used in this article.
The command line for the first record size (1MB) is,
./IOzone -Rb spreadsheet_output_1M.wks -s 16G -r 1M > output_1M.txt
The command line for the second record size (4MB) is,
./IOzone -Rb spreadsheet_output_4M.wks -s 16G -r 4M > output_4M.txt
The command line for the third record size (8MB) is,
./IOzone -Rb spreadsheet_output_8M.wks -s 16G -r 8M > output_8M.txt
The command line for the fourth record size (16MB) is,
./IOzone -Rb spreadsheet_output_16M.wks -s 16G -r 16M > output_16M.txt
IOPS Using IOzone
For measuring IOPS performance I'm also going to also use IOzone. While IOzone is more commonly used for measuring throughput performance, it can also measure operations per second (IOPS - IO Operations Per Second) with a simple command line option. More specifically, it can be used to measure sequential read and write IOPS as well as random read and random write IOPS.
For this article, IOzone was used to run four specific IOPS tests. These tests are:
- Random Read
- Random Write
As with the throughput tests, the IOPS tests used a file size that is twice the size of memory. The goal is to push the file size out of what could be cached by Linux.
For this article a total file size of 16GB was used. Within this 16GB file size, four record sizes are tested: (1) 4KB, (2) 8KB, (3) 32KB, and (4) 64KB record sizes. These sizes were chosen because the run times for smaller record sizes were much longer and using our good benchmarking skills of running each test 10 times, resulted in very long benchmark times (weeks). In addition, 4KB is the typical record size used in IOPS testing.
You might laugh at the larger record sizes, but there are likely to be applications that depend upon how quickly they can read/write 64KB records (I quit saying "never" with respect to application I/O - I've seen some truly bizarre patterns so "never" has been removed from vocabulary.).
The command line for the first record size (4KB) is,
./iozone -Rb spreadsheet_output_4K.wks -O -i 0 -i 1 -i 2 -e -+n -r 4K -s 16G > output_4K.txt
The command line for the second record size (8KB) is,
./iozone -Rb spreadsheet_output_8K.wks -O -i 0 -i 1 -i 2 -e -+n -r 8K -s 16G > output_8K.txt
The command line for the third record size (32KB) is,
./iozone -Rb spreadsheet_output_32K.wks -O -i 0 -i 1 -i 2 -e -+n -r 32K -s 16G > output_32K.txt
The command line for the fourth record size (64KB) is,
./iozone -Rb spreadsheet_output_64K.wks -O -i 0 -i 1 -i 2 -e -+n -r 64K -s 16G > output_64K.txt
A common benchmark used for HPC storage systems is called metarates. Metarates was developed by the University Corporation for Atmospheric Research (UCAR) and is a MPI application that tests metadata performance by using POSIX system calls:
- creat() - open and possibly create a file
- stat() - get file status
- unlink() - delete a name and possibly the file it refers to
- fsync() - synchronize a file's in-core state with storage device
- close() - close a file descriptor
- utime() - change file last access and modification times
Using these system calls, the main analysis options for metarates are the following:
- Measure the rate of file creates/closes (file creates/closes per second)
- Measure the rate of utime calls (utime operations per second)
- Measure the rate of stat calls (stat operations per second)
Metarates has options for the number of files to write per MPI process (remember that you will have N processes with a MPI application where N is a minimum of 1) and if the files are to be written to a single directory or to many directories. It also has the option of using the system call fsync() to synchronize the file's in-core state with the storage device.
Remember that Metarates is an MPI application allowing us to choose the number of processes (cores) we use during the run. So for this benchmark and this test system, 1, 2, and 4 cores were used (three independent tests). These tests are labeled as NP=1 (1 core), NP=2 (2 cores), NP=4 (4 cores) where NP stands for Number of Processes.
Not forgetting our good benchmarking skills, the run time (wall clock time) of the runs should be greater than 60 seconds if possible. So the number of files was varied for 4 MPI processes until a run time of 60 seconds was reached. The resulting number of files from the test was found to be 1,000,000 and was fixed for all tests. Also it was arbitrarily decided to have all files are written to the same directory with the goal of really stressing the metadata performance and, hopefully, the SSD.
The final command line used for metarates for all three numbers of processors (1, 2, and 4) is the following.
time mpirun -machinefile ./machinefile -np 4 ./metarates -d junk -n 1000000 -C -U -S -u >> metarates_disk.np_4.1.out
where the "-np" option stands for number of processes (in this case 4), "-machinefile" refers to the list of hostnames of systems to be used in the run (in this case it is a file name "./machinefile" that contains the test machine hostname repeated 4 times - once for each process), and the results to stdout are sent to a file "metarates_disk.np_4.1.out" which is an example of how the output files were named.
Notice that three different performance measures are used:
- File create and close rate (how many per second)
- File stat rate (how many "stat" operations per second)
- File utime rate (how many "utime" operations per second)
As mentioned earlier in the article, the basic testing process runs the benchmarks on a "clean" drive that is brand-new, followed by some heavy I/O tests on the drive, and then immediately the same set of benchmarks are run and compared to the first set of benchmarks. The set of benchmarks has already been discussed but the "torture" tests need to also be discussed.
The goal of the I/O intensive tests is to exercise the underlying storage media but this also means that the file system needs to be exercised. In particular, we want to stress the Intel SSD and then retest it to see how the various performance technologies help improve performance. So these I/O intensive tests should run both smaller and larger files as well as various record sizes. They should also stress the storage performance as much as possible to put the SSD controller under as much pressure as possible (this can help put block allocation techniques under pressure). The application chosen is IOR.
IOR is an MPI based I/O benchmark code designed to test both N-N (N clients reading/writing to N files) as well as N-1 performance (N clients all reading/writing to a single file). IOR has many, many options depending upon what you want to test but the basic approach is to break up the file into what are called segments. The segments are in turn broken into blocks. The data for each block is transferred in "t" size units (t = transfer size). Figure 1 below from a presentation by Hongzhang Shan and John Shalf from NERSC, shows how a file is constructed from these parameters.
Figure 1 - IOR File Layout
In this simple illustration, the segment size and the block size are the same (i.e. one block per segment).
Two IOR runs were made and each of these was repeated 10 times. The first IOR command line is:
mpirun -np 4 -machinefile ./machinefile ./IOR -a POSIX -b 64k -F -i 10 -s 200000 -t 4k -vv
The first part of the command, "mpirun -np 4 -machinefile ./machinefile" is all MPI command options:
- -np 4: This means that we are using 4 processes for this run (remember that IOR is an MPI code) which corresponds to 4 cores in the system.
- -machinefile ./machinefile: This tells MPI the location of the list of hostnames to use during the run. Since this is a single system, the file just lists the system hostname four times.
- ./IOR: This is the name of the executable
The IOR run options come after the executable IOR. These options are explained below:
- -a POSIX: This tells IOR to use the POSIX API (not MPI-IO or other API's)
- -b 64k: This option is the block size which in this case is 64KB.
- -F: This tells IOR to use 1 file per process. For this example since we have 4 processes, we will get 4 files (this is what is referred to as N-N I/O or N processes creating a total of N files).
- -i 10: This option tells IOR to run the test 10 times. This is the number of repetitions IOR itself will run during the test. However I will still run the IOR command 10 times.
- -s 200000: This tells IOR the the number of segments to use. In this case it is 200,000.
- -t 4k: This tells IOR the transfer size which in this case is 4KB.
- -vv: This option tells IOR to be fairly verbose with it's output.
IOR will run both a read and a write test with the previous options presented. You can calculate the size of the files based on block size, the number of blocks per segment, and the number of segments. However, it is easier just to show you the output from a single run of IOR with the specific options:
*** IOR test runs: Date stated *** Tue Sep 28 09:13:35 EDT 2010 *** Run 1 *** Tue Sep 28 09:13:35 EDT 2010 IOR-2.10.2: MPI Coordinated Test of Parallel I/O Run began: Tue Sep 28 09:13:37 2010 Command line used: ./IOR -a POSIX -b 64k -F -i 10 -s 200000 -t 4k -vv Machine: Linux test64 2.6.30 #5 SMP Sat Jun 12 13:02:20 EDT 2010 x86_64 Using synchronized MPI timer Start time skew across all tasks: 0.05 sec Path: /mnt/home1/laytonjb FS: 58.7 GiB Used FS: 0.2% Inodes: 3.7 Mi Used Inodes: 0.0% Participating tasks: 4 task 0 on test64 task 1 on test64 task 2 on test64 task 3 on test64 Summary: api = POSIX test filename = testFile access = file-per-process pattern = strided (200000 segments) ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 4 (4 per node) repetitions = 10 xfersize = 4096 bytes blocksize = 65536 bytes aggregate filesize = 48.83 GiB Using Time Stamp 1285679617 (0x4ca1ea01) for Data Signaturev Commencing write performance test. Tue Sep 28 09:13:37 2010 access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 198.23 64.00 4.00 0.049995 252.24 2.23 252.24 0 XXCEL [RANK 000] open for reading file testFile.00000000 XXCEL Commencing read performance test. Tue Sep 28 09:17:49 2010 read 241.51 64.00 4.00 0.046069 207.03 3.40 207.03 0 XXCEL Using Time Stamp 1285680083 (0x4ca1ebd3) for Data Signature Commencing write performance test. Tue Sep 28 09:21:23 2010 ... Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- ------- write 198.43 197.84 198.19 0.17 0.25 0.25 0.25 0.00 252.28032 4 4 10 1 0 1 0 0 200000 65536 4096 52428800000 -1 POSIX EXCEL read 242.43 240.65 241.78 0.47 0.31 0.31 0.31 0.00 206.79841 4 4 10 1 0 1 0 0 200000 65536 4096 52428800000 -1 POSIX EXCEL Max Write: 198.43 MiB/sec (208.07 MB/sec) Max Read: 242.43 MiB/sec (254.20 MB/sec) Run finished: Tue Sep 28 10:31:01 2010
From the output you can see that the aggregate file size is 48.83 GiB (remember we have to keep the file size under 64GB since that is the size of the drive and 58.7 GiB is the size of the formatted file system). In some early testing, this IOR command took about 7 minutes to run on the test system.
The second IOR command line is very similar to the first but with different block size, transfer size, and number of segments. The command line is
mpirun -np 4 -machinefile ./machinefile ./IOR -a POSIX -b 1M -F -i 10 -s 14000 -t 256k -vv
The output from this IOR run is,
IOR-2.10.2: MPI Coordinated Test of Parallel I/O Run began: Tue Sep 28 10:31:03 2010 Command line used: ./IOR -a POSIX -b 1M -F -i 10 -s 14000 -t 256k -vv Machine: Linux test64 2.6.30 #5 SMP Sat Jun 12 13:02:20 EDT 2010 x86_64 Using synchronized MPI timer Start time skew across all tasks: 0.00 sec Path: /mnt/home1/laytonjb FS: 58.7 GiB Used FS: 0.2% Inodes: 3.7 Mi Used Inodes: 0.0% Participating tasks: 4 task 0 on test64 task 1 on test64 task 2 on test64 task 3 on test64 Summary: api = POSIX test filename = testFile access = file-per-process pattern = strided (14000 segments) ordering in a file = sequential offsets ordering inter file= no tasks offsets clients = 4 (4 per node) repetitions = 10 xfersize = 262144 bytes blocksize = 1 MiB aggregate filesize = 54.69 GiB Using Time Stamp 1285684263 (0x4ca1fc27) for Data Signature Commencing write performance test. Tue Sep 28 10:31:03 2010 access bw(MiB/s) block(KiB) xfer(KiB) open(s) wr/rd(s) close(s) total(s) iter ------ --------- ---------- --------- -------- -------- -------- -------- ---- write 197.90 1024.00 256.00 0.000222 282.96 1.02 282.96 0 XXCEL [RANK 000] open for reading file testFile.00000000 XXCEL Commencing read performance test. Tue Sep 28 10:35:46 2010 read 241.42 1024.00 256.00 0.000369 231.97 6.03 231.97 0 XXCEL Using Time Stamp 1285684785 (0x4ca1fe31) for Data Signature Commencing write performance test. Tue Sep 28 10:39:45 2010 ... Operation Max (MiB) Min (MiB) Mean (MiB) Std Dev Max (OPs) Min (OPs) Mean (OPs) Std Dev Mean (s)Op grep #Tasks tPN reps fPP reord reordoff reordrand seed segcnt blksiz xsize aggsize --------- --------- --------- ---------- ------- --------- --------- ---------- ------- ------- write 197.95 197.68 197.79 0.09 0.06 0.06 0.06 0.00 283.12821 4 4 10 1 0 1 0 0 14000 1048576 262144 58720256000 -1 POSIX EXCEL read 242.10 241.06 241.56 0.31 0.07 0.07 0.07 0.00 231.82731 4 4 10 1 0 1 0 0 14000 1048576 262144 58720256000 -1 POSIX EXCEL Max Write: 197.95 MiB/sec (207.56 MB/sec) Max Read: 242.10 MiB/sec (253.86 MB/sec) Run finished: Tue Sep 28 11:57:59 2010
This IOR test uses a slightly larger file size of 54.69 GiB. It also takes about 90 minutes for this test to finish.
These two IOR command lines were run 10 times each as the I/O intensive benchmark for the SSD. Notice that the file sizes are quite large to put maximum pressure on the drive and the block sizes are both small and large to put even more pressure on the SSD and it's controller.
With the explanation of the benchmarks behind us, let's move on to the meat of the article - the results.