Modeling computer systems has been used in the development of large systems for many years, and is a requirement in some environments as part of the architecture design process. This month we will look at some of the processes and reasons for modeling and simulation, as well as provide a bit of history on modeling and simulating computer systems.
The commercial application of modeling and simulating computer systems has been around for over 30 years. It was originally developed for mainframes when BGS (I believe an MIT spin-off) created a modeling package for IBM mainframes. In many ways, mainframes were and still are far simpler to model than current Unix systems for a number of reasons:
- They have far more deterministic queues of work
- They provided (even back then) far more information on the queues of work in the operating system
- Companies were paying large amounts of money and needed to plan for the future, and modeling was (and is) in IBM's best interests in most cases, as it allowed them to sell more hardware
- Service level agreements required accurate predictions of performance
Modeling, simulation, and capacity planning became a requirement for mainframe systems. This process was often combined with system tuning. Smaller mainframe companies such as Univac developed their own in-house modeling, capacity planning, and tuning groups, as their market share did not justify independent companies developing products.
Fast Forward to the Early 1990s
Fast forward to the early 1990s when there was an effort to develop mainframe-like statistics for Unix by a group that started under Unix System Laboratories. The group was called the Performance Management Working Group (PMWG) and was headed by Shane McCarron. This group then moved under XOpen and actually published a standard, but since it wasn't adopted by any hardware vendors, the work could technically be considered a failure.
At the same time, a number of software vendors were writing data collectors to collect performance statistics (any and all) from various vendors' operating systems. Combining this with statistics from Oracle and early web servers, some vendors actually wrote drivers to track everything done in the kernel. My experience with these is they provided much needed statistics but with an extremely high overhead given the single-threaded implementation of these drivers.
The problem with modeling Unix systems is that they did not have bell-shaped distributions like mainframes did. Sometimes the distribution was multi-modal, which presented a problem for the standard modeling techniques used on mainframes based on queuing theory. Here is some good reading if you're interested:
http://www.cs.uml.edu/~giam/Mikkeli/ (Lectures 1-10)
This was especially true for networks, so some of the early products developed were for networking only, such as COMNET (which is now owned by Compuware). The methodology most often chosen for modeling by these vendors was discrete event simulation. This method allows the representation of events and the interactions within the system. It did not depend on a normal distribution of these events. Discrete event simulation has been used to model everything from productions lines at fast food restaurants to computer chips, including RAID controllers hardware and software.
I am aware of two types of modeling conducted today for computer systems:
- Modeling a current system to plan for the future and/or try out "what if" scenarios
- Modeling a system that has not yet been built to determine if the proposed architecture will meet the expected requirements
Modeling Current Systems
Not long ago, we were asked to model the batch queuing system at a site. The site was running many jobs through a multiple-hundred-processor system, with some of the jobs requiring 256 processors. In addition, the customer had many queues to meet the customer's demands and different scheduling algorithms for day and night. The customer had a number of goals for this modeling effort, including:
- Determine the optimal queue configuration to meet workload scheduling goals and objectives
- Determine the best scheduling algorithm for day and night based on workload goals and objectives
- Understand the job arrival rates over long and short periods of time
- "What if" planning for changing:
- The computer system
- Job workloads (number of processors requested)
- Workload schedule goals and objectives
A model was developed based on the job queuing system, the queue structure, the queuing system scheduling algorithm, the job resource requirements, and the job arrival and departure information. All of this was parameterized using a discrete event simulation tool.
After some training, the customer was able to operate the model to fully understand the implications of the resource requirements, the scheduling, the workload, and queuing system. They used this information to tune the scheduler to improve both system utilization and response time.
Modeling and Simulating Future Systems
Not long ago, as part of a response to a European customer's ITT (Invitation To Tender), which is like a U.S. Government RFP, we were asked to provide a model showing that the proposed tape architecture would meet the requirements for response time. The number of tape drives in this system we estimated to be around 30-50, and even our response team was unsure of how many were going to be required. Adding to the complexity, we were bidding a tape drive that had not been announced yet, so all we had when we started were spec sheets from the vendor. The required model provided a good way for us to understand what to bid and for the customer to ensure that we based our bid on something better than a dart throw from a bunch of engineers in a room.
The first step was a visit to the hardware vendor's engineering staff for an in-depth discussion of how this tape drive worked. We needed to understand, among other details:
- Load time
- Position time
- Error recovery and performance
- Performance in general
- Data compression algorithm
- Hardware interface issues with Fibre Channel and Ultra-SCSI
- Robot pick issues
- Robot database issues
- Software access request times within the application
- Database lookup time for where the tape was in the robot
- Pick time and movement time for the robot
- Load and average position for the tape
- Transfer rate of the tape drive -- including transfer rate based on compression
- The amount of time the tape would be loaded into the drive before it was removed
The site was a weather forecast center that had done atmospheric modeling for many years, so not surprisingly, they clearly understood the scientific process of modeling and its value for predicting the future.
So Why Doesn't Everyone Model Their Systems?
I get asked this question all of the time. If modeling is such a good thing, why doesn't everyone develop models of their systems. The biggest issue is that modeling is a very expensive process. To create a model, you likely need to have a staff of people that understand modeling, the hardware, the software, and how to calibrate models. Having these types of people on staff is expensive, and unless you are constantly developing and implementing new architectures of great complexity, having this staff in-house is just not done very often today.
Using modeling for most system tuning is not done for similar reasons, as management and the accounting department feel (and often rightfully so) that buying hardware is cheaper than modeling systems. I have been involved in modeling projects that ran into the multiple-person-year effort, costing over $750K. If the system only costs a few million, you will have a hard time convincing management that modeling is worth the cost and effort. On the other hand, if the system has a defined level of service and is very complex, modeling might be the only way to ensure that the requirements are met. Additionally, if the system cost is $10 million or greater, it is a perfect candidate for modeling, and in every case I am aware of, the modeling effort has more than paid for the system cost in:
- Reduced hardware costs - Both initially, as you only buy the hardware you need, and over the life time, as now you have a capacity planning tool that matches your system
- Improved response time - A good model will predict the performance of the system and allow you to meet current, future, and unexpected requirements
- Upgrades - Will the CPU, memory, and/or storage upgrade be worth the cost and still meet the requirements (this will require a model update)
- Tuning resources - I have yet to come across a large system that hasn't required some level of tuning effort. By first modeling the system, the tuning process is dramatically reduced, resulting in cost savings (or cost avoidance in "management speak")
Next month we will cover the details of the modeling process, including an introduction to some of the software packages and the process of creating a model. We'll also look at the different skill sets needed for successful modeling, including the ability to understand how to calibrate models and how to collect the data to produce models. See you next month.