This is the second in a three-part series on benchmarking. Part 1 examined each of the components that might be included in a typical benchmark. Today we’ll look at developing representations of your workload as well as the pros and cons of using your applications and real data in the benchmark as opposed to developing […]
This is the second in a three-part series on benchmarking. Part 1 examined each of the components that might be included in a typical benchmark. Today we’ll look at developing representations of your workload as well as the pros and cons of using your applications and real data in the benchmark as opposed to developing emulations of both.
The most important part in the development of a set of storage benchmarks is ensuring that the benchmarks represent your current workload and how you run that workload on the system(s). With an understanding of your current workload, you can begin to predict how the current workload relates to the future workload.
There are many ways to create workloads that represent your real work. Some are more difficult for you to create, while some are much harder for the storage vendors to run, and of course some are halfway in between (I know, sounds a bit like Goldilocks and the Three Bears). Whatever you chose, you need to completely understand the tradeoffs and ensure that your organization understands the advantages and disadvantages of the decisions to be made.
Characterization
Going into the development of a benchmark, you have some hard choices to make. The first is whether the benchmark will utilize your real applications and real data, or will an emulation of the workload(s) need to be developed.
Each of these two paths has pros and cons for both you and the vendor. You might wonder why I seem to be so concerned for the vendors. Besides once being a vendor and a benchmarker, I have come to realize that there is no free lunch when doing benchmarks. In other words, if you develop an expensive benchmark, quite often the cost of the benchmark will more than likely be rolled into the bid price of the hardware depending on the size of your procurement and how much the vendor(s) want your business.
Here’s how I see the tradeoffs:
| Characterization | Advantage | Disadvantage |
| Use actual applications code | This is the best measure of your real work if you structure the benchmark correctly. For database benchmarks, using the actual data may have security issues for your company, but using actual data will result in the most realistic benchmark | These types of benchmark generally have more setup time and are more difficult to run for the vendors, as a great deal of application tuning, file system tuning, system tuning, and RAID tuning may be required, all of which can affect final pricing |
| Develop a set of representative storage benchmarks | Far easier to run and to scale to larger workloads. Running this type of workload is also far easier for the storage vendors | Sometimes difficult to develop characterization of the workload given system tunables, file systems, volume managers, and the actual storage hardware |
Page 2: Using Your Workload
Using Your Workload
If you are going to use your actual workload in a benchmark, the first step is end-to-end hardware and software characterization. You need to document and understand:
All of this may seem obvious, but if you’re going to give a storage vendor your benchmark, the more documentation that you provide them with the more likely the results will meet your requirements and the fewer questions you will have to answer. And if you’ve gone so far as to document the above, then creating the operational procedures for things such as remote mirroring, tape backup, and other operational requirements will not be difficult.
When using your own workload in a benchmark, there are several additional areas that need to be clearly understood and documented for the storage vendors, including:
Emulating Your Workload
If you have a staff that can program in C, then writing the code to emulate your workload will not be that difficult. I believe that if you have done a good job with the emulation, then you’ll have a great deal more control of the benchmark in terms of scaling, and you’ll have a far better understanding of what your workload does to the actual hardware and software.
It also allows you to test the storage vendors’ hardware without the file system, as you can write/read directly to the raw devices. This allows a better understanding of the hardware that might otherwise be masked by the file system’s effect on I/O performance.
The steps for developing an emulation are relatively simple:
![]()
![]()
This seems fairly difficult and can be, but once you have completed the process, you’ll be able to easily scale your workload up and down. Another advantage is that when the vendor receives the benchmark information, you will be receiving from them a true benchmarking of the actual storage hardware, not the file system and volume manager tunables.
Page 3: What About Software?
What About Software?
Even if you’re going to benchmark a file system or shared file system, much of what was recommended for the analysis of the hardware should be done for the software. One big difference is obviously you cannot write to the raw device if you are testing a file system.
Benchmarking a file system is likely to be the most difficult benchmarking task because there are so many variables, and doing it correctly is very time consuming, both for you and for the vendors.
Here are some items that must be characterized as part of the process for benchmarking file systems, building upon the characterizations already done for the hardware:
For shared file systems, add:
Along with this you have the hardware topology, including HBAs, switches, TCP/IP network for metadata, and possibly tapes, as most shared file systems have an HSM (Hierarchical Storage Management) system built into them.
Developing the scripts, codes, and methodology to do this type of benchmarking is hard work, but while hard on your end, for the storage vendor it will be virtually impossible, as most have limited relationships with shared file systems vendors, limited server resources, and limited staff that know shared file systems.
Often what this type of benchmark becomes is really a benchmarking of the benchmarker, not a benchmarking of the software and hardware. The vendor who often wins is not the one with the best hardware and software, but the vendor with the best benchmarks. Therefore, it’s important to give the storage vendors as much information and guidance as possible.
The most important part of a file system benchmark that is often forgotten is creating fragmentation as part of the benchmark. Most file system benchmarks create a new file system and run the benchmark tests with tools such as iozone and bonnie. Most of the time this is not really valid given that on a real file system users’ files are created and removed many times, and multiple files are often written at the same time.
Some of the areas to look at are:
Each of these issues will have an impact on the benchmark that you create, regardless of which tool you use.
Conclusions
The process of developing a benchmark that mimics your operational environment — those are the key words. The process of determining what the characteristics of your environment is the first step. Workload characterization, though seemingly a difficult process, is not really that difficult when separating it into the various parts of application I/O, file system configuration, system and file system tunables, and hardware configuration requirements. It’s also important to keep in mind how the new system will be used as compared to how the old system was used.
Next time we will review the process of packaging, rules, analysis, and scoring.
Henry Newman has been a contributor to TechnologyAdvice websites for more than 20 years. His career in high-performance computing, storage and security dates to the early 1980s, when Cray was the name of a supercomputing company rather than an entry in Urban Dictionary. After nearly four decades of architecting IT systems, he recently retired as CTO of a storage company’s Federal group, but he rather quickly lost a bet that he wouldn't be able to stay retired by taking a consulting gig in his first month of retirement.
Enterprise Storage Forum offers practical information on data storage and protection from several different perspectives: hardware, software, on-premises services and cloud services. It also includes storage security and deep looks into various storage technologies, including object storage and modern parallel file systems. ESF is an ideal website for enterprise storage admins, CTOs and storage architects to reference in order to stay informed about the latest products, services and trends in the storage industry.
Property of TechnologyAdvice. © 2025 TechnologyAdvice. All Rights Reserved
Advertiser Disclosure: Some of the products that appear on this site are from companies from which TechnologyAdvice receives compensation. This compensation may impact how and where products appear on this site including, for example, the order in which they appear. TechnologyAdvice does not include all companies or all types of products available in the marketplace.