Benchmarking: Time for Vendors to Get a Clue

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More.

With the Supercomputing show over, its myriad benchmark results provided on everything from CPU flops to memory bandwidth got me thinking about benchmarks in general, what they mean and why you should care or not care. I believe we mostly should not care. I am going to focus on I/O benchmarks, but the principle applies to pretty much all benchmarks, and it applies to far more than just high-performance computing. I have seen an arms race on the NAS front with SPEC SFS NFS IOP. The benchmark arms race must end. Enterprises’ focus must change, and we need to all have an adult discussion about the change.

Weekly and sometimes daily benchmarks claiming that a given vendor has surpassed other vendors on XYZ I/O benchmark provides little useful information. Most benchmarks tell us:

How much hardware the vendor can fit (and I really mean stuff) into the box it is trying to benchmark
How good the benchmark team is at optimizing the software (OS, file system, network stack, firmware for all hardware) and the hardware
How much time and money the company will want to spend running case after case to achieve the results it wants

You might ask how I know this, and the answer is that 1996, I was part of a team that benchmarked some hardware, OS and file systems to achieve 1 GiB/sec. At the time that was huge. Back then, we got 60 percent of the performance needed with 50 percent of the hardware, so we were not scaling very well.

Fast forward to today, and nothing has changed. I think it is time, as mentioned, to have an adult conversation. Peak performance is not what matters, what matters is scalability. I challenge all of us to demand benchmark results measure not just peak performance of XYZ configuration but also peak performance for a configuration XYZ/4, where 4 is one-fourth of the hardware used in the full configuration.

The hardware should be configured such that the raw hardware bandwidth to the slowest component is one-fourth the performance of the larger hardware solution. Additionally, XYZ configuration should be configured as XYZ/2 for half the performance of the system. Basically, to understand the scalability of the hardware and software stack, you need three data points. Any three data points will do, so long as they are reasonable to understand the scalability of the system. Yes, the maximum performance does matter, as it shows what the hardware and software are capable of, but it matters in my opinion far far less than understand how the system scales.

Vendors in my experience are unlikely to ever provide this kind of information, as systems don’t generally scale especially well for things like I/O along with the fact that it is costly to do more than one set of benchmarks on different hardware configurations, especially when you are trying to obtain macho numbers and continue the arms race.

About six years ago, I was asked to work on developing specifications for one of our customers to look at I/O requirements for a new HPC system. I remembered my 1996 benchmarking effort and others like it, and I thought about the problem long and hard. My task was to develop specifications for benchmarking I/O performance on a new class of high productivity computer systems under the DARPA HPCS program. What I decided was important was not the absolute performance of the system for I/O, but rather the performance of the system with real-world I/O problems and how the system scaled for those problems from very small configurations to very large configurations and everything in between. I worked with the user community to develop the HPCS Mission Partner I/O Scenarios.

A good starting point if you are interested would be to read the DARPA.HPCS.IO.Scenarios.2011.pdf, which talks about the background and motivation. I think the key area that makes this set of benchmarks different is scalability. The document states:

Scaling performance is important to all Scenarios. Scaling performance means that by adding equal amounts of hardware the I/O performance for the Scenarios scales nearly linearly. Scaling I/O performance is similar to scaling CPU performance in that Amdahl’s law can be applied to both.

There are 14 different I/O performance tests that test the following areas of I/O performance important to the organizations that are going to buy large HPC machines that might have thousands or tens of thousands of nodes and hundreds of thousands of cores and many PB of storage.

Single stream with large data blocks operating in half-duplex mode
Single stream with large data blocks operating in full-duplex mode
Multiple streams with large data blocks operating in full duplex mode
Extreme file creation rates
Checkpoint/restart with large I/O requests
Checkpoint/restart with small I/O requests
Checkpoint/restart large file count per directory – large I/Os
Checkpoint/restart large file count per directory – small I/Os
Walking through directory trees
Parallel walking through directory trees
Random stat() system call to files in the file system – one (1) process
Random stat() system call to files in the file system – multiple processes
Small block random I/O to multiple files
Small block random I/O to a single file

As you can see, everything from streaming I/O to random I/O and using files (not the raw device) and, of course, looking at the impact of metadata operations when doing real world I/O (real-world for HPC environments, of course). Who really cares if you can stream data at ridiculous speeds if you cannot get your real work done? Let’s say there is a file system that can go fast if you have 16 out of 10,000 nodes writing to the file system, is that an effective way to evaluate the performance of the system? It is, of course, an effective way to benchmark the system, but does it provide the results and help in evaluating the technology for the environment.

Final Thoughts

Although 15 years ago I was part of the problem, it is now time for me to be part of the solution. Benchmarking storage systems or any system for the sake of achieving some macho performance number makes no sense as an evaluation tool for customers. It does make for press releases for the marketing department, but that does not help you or the users with applications requirements. I have been working with people on developing these HPCS Scenarios for a number of years, and I am very happy that they are finally released, and we can begin the conversation about looking at the scalability of storage systems in real-world environments rather than then just measuring how fast you can read, write or do full duplex I/O to the storage system without any realistic usage patterns or realistic usage of the communication to and/or from the storage system.

You might say that this is an HPC-only problem, but it is not. I have talked with a number of people during the past week since the release and have found that the problem is pervasive throughout the market from the smallest SMB to the largest of enterprises. What counts and has always counted is scalability of the system. This is not to say that hardware and software developers cannot learn things from looking at maximum performance, and it is important for them to understand the bottlenecks. Even more importantly, however, they must understand the bottlenecks, as the system is scaled up to meet the requirements. This is true whether the world you live in is SMB or enterprise storage.

I learned a great deal over the past 30 years about storage and storage requirements, and my strong hope is that people who look at the HPCS Scenarios update them to do things like collect configuration information about the system and adapt them to other storage environment and requirements from the NAS market, to SMB file systems to enterprise database storage to you name the operational environment. We must change the discussion from macho GiB/sec, TiB/sec, PiB/sec, x million IOPS (you pick the metric) to a discussion of about how the hardware bottleneck was double and the system scaled at xx percent of scaling. It is hard to change an industry with decades of doing things the wrong way, but I think this is a good first up step.

Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn’t require diplomatic skills. Diplomacy’s loss was HPC’s gain.

Follow Enterprise Storage Forum on Twitter