SSD IOPS Arms Race: Does It Matter?
During the past few weeks, I have been trying to get some information about how long a kernel context switch takes. You might ask, what is a context switch? What I am trying to find out is, if you are running an application and make a system call I/O request, how long does it take to make the request if all the cores are running all the threads that are supported by the core type and are running user applications? What if all the cores are not running user applications and the kernel is running in a core?
I ask these questions because the answer has an impact on how many I/O requests a single thread can make. This is important, as there is an SSD arms race out there. Vendors are making SSDs with higher and higher IOPS. Does anyone really need an SSD that does 1 million, 2 million or more IOPS for your problem? If the performance problem for speeding up file system metadata for commands as a find or fsck are not making parallel requests, except for some file systems, there might be a readahead. This was discussed in Jeff Layton's latest article on fsck performance. Does it make sense to have an SSD for this task that does huge numbers of IOPS, or does an SSD that supports few IOPS provide similar performance?
Let me preface the rest of this article with the following statement: IOPS performance does matter up to a point, if you have a good number of cores (which today's CPUs provide) and you have a number of user applications running or making I/O requests. On the other hand, if you require a single application thread to make large numbers of I/O requests, the I/O problem is the latency in the datapath, including the application to kernel time, the time in the kernel, file system and driver stack and the time to the SSD (time to the PCIe device or time to the SAS/SATA device, including the time on the wire) as well as the time in the SSDs. If you make a synchronous read I/O request and then wait for the I/O, the latency on the whole path, including the SSD, will be the limit as to how many I/O requests you can issue.
For a synchronous writing, make the request and wait for the acknowledgement before issuing the next I/O request. The latency matters, as you cannot return control to your program until you get the acknowledgement. If the I/O is buffered in memory that could be pretty quick, but getting all the way to the storage device takes significant time. If you are doing asynchronous I/O, the interrupt still happens, but you get control back as soon as you make the request. You do not have to wait until the application asks to synchronize the I/O request. This is what happens for aioread and aiowrite for listio; specify a list of I/O requests in one system call to the kernel. Additionally, you could do asynchronous I/O and many to by using many threads each doing I/O to, in effect, emulate asynchronous I/O.
So I began on my quest to find out how long it took to get into and out of the kernel and to make an I/O request both when the kernel is running in a core and when the kernel must be moved into a core. I wanted the time to be provided in clock periods for each system type of CPU. Was this too much to ask? One other thing I was told by a few people is that different chips take different amounts of time to do a context switch. If a chip has more registers, it takes longer to change context to another application to be able to save the state of the registers and save the state of the thread. I found this interesting, as I had never thought about the problem that way. This is another example of Amdahl's Law, which can tell you the maximum expected performance improvement when only one aspect of system performance is improved.
Then off to Google I went, and I found some wildly varying numbers. The lowest number was on the OSDev.org sites, and I found a few other interesting articles including, Linux 2.6: A Breakthrough for Embedded Systems, which does not take the I/O into account.
Tech Paste discusses clocks counts for interrupts in the article Monitoring Lock Contention on Solaris, and the journal article When Poll is Better than Interrupt discusses interrupts vs. polling for Intel's new NVM express interface.
A few other points: I am focused on Linux context switch time -- not Windows, not AIX, not Solaris, and not any other operating system. I suspect that context switch time will likely be longer in some of these, and it is unlikely to be much shorter in most of these.
Similar interrupt overhead for the last two examples as compared to the first two examples. In addition, I put word out to a number of friends of mine in the industry. One of the inputs I received was from a person I have known for more than 30 years. He develops Linux drivers for a large company, and I trust him a great deal. Using the Nehalem-EX CPU, he gave me some inputs that were a higher number and anyone else.
First the highest numbers:
|GHz of core||3.0|
|Time in second for a clock||0.0000000003|
|Number of clocks to interrupt and move to OS Land||138,000|
|Time for I/O in the kernel in clocks||60,000|
|Number of clocks to interrupt an move back to user land||138,000|
|Total seconds for interrupt||0.0001|
|Number of interrupts per second||8,928.57|
|Min. I/O request per interrupt for 1,000,000 IOPS||112.00|
Now for the lowest numbers:
|GHz of core||3.0|
|Time in second for a clock||0.0000000003|
|Number of clocks to interrupt and move to OS Land||2,598|
|Time for I/O in the kernel in clocks||2,598|
|Number of clocks to interrupt an move back to user land||2,598|
|Total seconds for interrupt||0.0000|
|Number of interrupts per second||384,911.47|
|Min. I/O request per interrupt for 1,000,000 IOPS||2.60|
A few points on the numbers:
- The time for I/O in the in the kernel on the lowest numbers was just a guess, as the article did not discuss how long it took to do the I/O. I think that this is a very low number.
- The times for the highest numbers are based on the input from my friend.
If you are doing single threaded operations, the limiting factor is going to be the time it takes to do the I/O switching between the user and the kernel. For file system operations like find and fsck I think the difference between having a 100,000 IOP SSD and a 1 million IOP SSD likely does not matter. Of course, if multiple users are issuing the find command then it likely will, but there is a limit that is based on the file system and the kernel, as you can do only so many operations, even if you have a 1 million IOP SSD. Gene Amdahl proved this oh so long ago, and we seem to forget the impact of what limits hardware performance. The question is, how many IOPS can your applications issue? If the applications are not threaded or using asynchronous I/O, then clearly there is a limit as to how fast you can get into and out of the operating system, especially if user applications are running on all of the cores. SSDs are a great thing, and I suspect that in the near future we will see changes to operating systems to allow them to work more efficiently. It has been done in the past, and as I have said time and time again, there are no new engineering problems, just new engineers solving old problems.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.