Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Like two fencers in a dark room separated by 50 feet, both users and vendors will insist that they are stabbing in the right direction. Are you asking for vendor benchmarks for configurations that match your applications? Are vendors testing storage solutions with tests that approximate your applications on configurations that you will purchase?
Nose Your IO Patterns
In my previous article, I talked about a new benchmarking reality where the benchmarks should include the performance during a drive rebuild and report the amount of time for the drive rebuild to finish while under load. I made a comment in that article, "... As a side note it is quite common for customers to request a configuration that doesn't match their IO patterns and workloads just because of either folklore, urban myths, or just because it's easy, rather than actually characterize the IO patterns of their applications. ...". I want to talk about this statement in this article because of a very, very important concept that people miss—knowing the IO pattern of your application(s).
Let's do a simple experiment. Write down the top three applications on your system. These can be the ones that use the most CPU time or the ones that are run the most often or the ones that use the most data or the ones that seem to be most IO intensive or even the ones that seem to run the slowest. Think about how these applications do IO and write down what you believe is the IO pattern.
It's definitely not easy is it? Believe it or not, even application developers have a difficult time telling you how their applications do IO.
All developers can tell you that at certain points during the execution of the application IO will be performed. And they can sometimes tell you which language IO functions are used (e.g. fwrite(), fread(), write(), read() in the case of C and C++) but that is about it. They are usually focused on the algorithm itself and not so much on how the data gets into or out of storage.
Of course, I don't really blame them because the algorithms are difficult enough without having to focus on the IO pattern of their application. However, what this means is that it is almost impossible to design a storage system that meets the IO needs of the application. It's as if you go to a shoe store to buy size 10 tennis shoes and walk out with flip-flops that are size 15 and a pair of yellow socks. This will become even more acute as the data growth accelerates.
If you are able to describe and demonstrate the IO pattern of your application(s) then you are one of the few that I know that can do that. Pat yourself on the back, publish the results and process(es) you used, and please help others do this. But for the remaining 99.9 percent of us in the world, describing IO patterns can be very difficult. We have to start somewhere so let's by start by using some of the typical metrics that describe IO patterns.
Start at the Starting Line
The most fundamental question you can answer about the IO pattern of your application is "Does IO take up a significant portion of the run time?" In other words, "Is IO important?"
Believe it or not, this question is also not easy to answer precisely. You have to be able to measure the amount of time spent doing IO without unduly impacting the overall run time of the application. But there is a simple way of doing this for many applications—strace.
Strace is the system tracing tool in *nix operating systems. It traces system calls and can generate quite a bit of information, such as the completion status (did it complete?), the parameters of the system call, the elapsed time to complete the system call, and in the case of reads and writes, the success of the function (how much data was actually written or read?). With this information and a little work you can examine the IO pattern of applications.
Virtually all IO is done via system function calls, so strace should be able to capture quite a bit of IO information.
One word of caution—if the application is doing mmap IO where it doesn't use system IO functions, then strace won't be helpful. But if you are using mmap IO then you have other issues so understanding the IO pattern may not be a high priority.
The strace information provides some insight into the IO requirements from an application's perspective. It gives you the system function calls, including the IO ones, that the application makes to the operating system. In other words, the IO that the application is making to the system. There are several layers that data has to traverse to actually get to the storage media, but that is within the operating system and not a function of the application.
Below is a snippet of some strace output from a simple example that writes some data structures.
1373231279.242784 write(3, "\1\0\0\0\2\0\0\0\3\0\0\0\0\0 A\2\0\0\0\3\0\0\0\4\0\0\0\0\0\240A"..., 4096) = 4096 <0.000044> 1373231279.242921 write(3, "\1\1\0\0\2\1\0\0\3\1\0\0\0\240 E\2\1\0\0\3\1\0\0\4\1\0\0\0@!E"..., 4096) = 4096 <0.000034> 1373231279.243064 write(3, "\1\2\0\0\2\2\0\0\3\2\0\0\0P\240E\2\2\0\0\3\2\0\0\4\2\0\0\0\240\240E"..., 4096) = 4096 <0.000034> 1373231279.243188 write(3, "\1\3\0\0\2\3\0\0\3\3\0\0\0P\360E\2\3\0\0\3\3\0\0\4\3\0\0\0\240\360E"..., 3712) = 3712 <0.000034> 1373231279.243283 close(3) = 0 <0.000013>
For this example I used the "-T -ttt" options with strace to get the execution time of each system function (the last number in the <>).
In the above strace snippet, the first number on each line is the number of seconds since epoch that mark the start time of the function. The number of bytes actually written is also shown after the "=". The amount of data just before the ")" is the amount of data that is requested to be written and the amount of data after the "=" is the actual amount of data written.
For the above snippet, 4KiB was sent to the operating for the first three writes and 3,712 bytes in the fourth write. This is the amount of data sent to the write() system function which then sends the data down into the operating system and ultimately to the storage media.
But the OS has buffers and will try to combine (coalesce) data requests that are next to one another to improve the overall performance. Strace output cannot gather that information—it only shows the data from the system function to the system. But the important point is that strace gathers the IO patterns from the perspective of the application.
The C code corresponding to the previous strace output is from the C function "fwrite." This function buffers the amount of write data until it reaches a certain limit, in this case 4 KiB, before executing a write() system function. It is possible to use greater buffer sizes (good article here) but this requires some work on the developer's part.