In open storage environments it’s quite common for a person or group of people to be called upon to resolve an issue that involves the hardware and/or software of multiple vendors. More often than not these are very complex problems that need to be resolved quickly and efficiently. I say they are complex as this is the case more often than not, but sometimes they are actually quite simple — albeit not always obvious — to resolve.
What I have seen in many of these cases is that the survival instincts of the vendors in question kick in, with their initial reaction being to protect their turf. When this happens, the vendor can sometimes be a bit too quick to point a finger at another vendor (or vendors) — or even at someone or some group in your company.
As you might imagine, this typically isn’t all that conducive to efficiently resolving the issue at hand. And as someone with extensive experience in having fingers pointed at him, pointing fingers at others, and resolving finger-pointing issues for customers, I see this as an issue that looms larger each day, especially as storage systems become more and more heterogeneous in nature.
So with that said, let’s take a look at how, when, and at whom a finger should be pointed, as well as how to reduce and mitigate finger pointing.
Here’s a short list of players that might be involved in a storage finger-pointing situation:
- The Integrator
- The various hardware vendors, including:
- Remote Connection
- The various software vendors, including:
- Operating System
- File Systems
- May have a bad cable
- May have changed a software setting
- May have changed a hardware setting or configuration
Resolving finger-pointing often boils down to an exercise in good detective work, requiring a bit of solid investigative work to discern where responsibility for the issue ultimately lies.
Where to Begin
Someone who is good at finding the underlying cause of a finger-pointing exercise would make a very good detective. It needs to be someone who trusts no one without verifying the answer. As I have said before in my columns, and as Ronald Reagan said it best, “Trust but verify.”
As I have just completed a real world exercise in resolving a finger-pointing situation, let me take you through the steps I went through with a customer. (The customer’s name, associated vendors, and some of the facts have been changed to protect the innocent.)
To start, a customer calls and says they have a performance problem with a new database. They re-create the critical database every night for an internal group. They need help in resolving the problem given that:
- The server is from vendor R
- The RAID is from vendor T
- The FC switch is from vendor Z
- The database was written by a contractor, and the client isn’t sure it’s written efficiently
- The HBA is OEMed by vendor R, but is from vendor A
- The RAID is remote mirrored 10s of miles away using a dark fibre connection, but the mirror is no longer synchronous, so that should not be a problem
The client’s pleading request was along the lines of “Can you help us like right now — I am sending the purchase order.” In other words, the clock was ticking.
The first thing I asked the client was how long this issue had been going on. The response: months, but now it is critical, as management has given it visibility.
My next question was what performance tools did the client have. The client had just obtained a tool from company C and had planned to install it the following week. I asked them to install it that day and to get me the GUI for the tool.
I told the client I would need to look at the performance data for a week or two and them come on-site. I figured I wouldn’t be able to resolve the problem remotely nor would it make good sense for a complex problem such as this not to have a significant amount of performance data from each of the database update runs. Plus I wanted to get a good look at the end-to-end configuration and storage topology.
For a week I analyzed performance data and found that the RAID system was clearly SLOOOW, so I called the customer and asked a few questions about the RAID configuration, HBAs, and the switch, as they could not give me access to these remotely. Low and behold I found out that over 15 systems were connected to this RAID, but I was receiving performance data from only one of those systems.
So this made me question whether another system with another application was causing the problem and also whether the customer had the tool from vendor C on all the systems. I surmised the answers in this case would be not likely and no.
The On-Site Visit
Next came the on-site visit. I arrived at the site and started to look at the weekend run of the performance data. Clearly something had changed. It seemed that someone in the storage group — the group that controlled the RAIDs — had decided to change the switch between some of the machines and the RAID, allowing a less contentious connection between the database machine in question and the RAID. Performance had slightly increased as a result.
I asked a few questions about the database and someone piped up that the performance problem was nowhere near as bad before they started the remote mirror. Ah, finally a big clue. This led me to ask whether or not the performance before the remote mirror was acceptable. An answer of yes prompted a follow-up question of whether this was a mirror problem. No way, the client reported, saying they have 2 Gbps channels, mirrored asynchronously, and only created about 60GB of table space.
To this response I asked about indexes and whether the redo logs were on the same devices. The client confirmed that they recreated the indexes and that the redo logs were on the same device as the table space.
Large flashing lights started going off in my head! I asked to see the FC switch. I noticed the configuration was not 2 Gbit but 1 Gbit and asked why. The DWDM consultant told us that it was not working without errors at 2 Gbit, so they had decided to scale back to 1 Gbit. With more red flashing lights going off, I look at the switch GUI and find that only 16 Fibre Channel buffer credits are allocated to the port with the DWDM connection.
For each buffer credit you can queue a single Fibre Channel frame of 2 KB to the other system. Given the length of the fibre, you need enough queued I/Os to the entire channel to contain enough 2 KB fibre frames on the channel until the time you get the acknowledgement from the first fibre frame. Even the speed of light has latency, and is of course slower in fibre than in air.
As John Mashey said, “Money can buy you bandwidth, but latency is forever.” Given the latency of this connection and the distance, the client needed to have far more than 16 outstanding I/Os, and actually needed more credits than the switch vendor Z supported, although that information wouldn’t be readily forthcoming from the vendor.
The problem was resolved by changing the configuration of the switch, but how did this happen in the first place, and why wasn’t it diagnosed long before my arrival?
I believe a number of factors contributed to the problem remaining a mystery, including:
- Lack of communication among the vendors, as there was no integrator
- Lack of a single system architect overseeing and responsible for end-to-end issues
- Infighting among the PS organizations, which led to them not talking to each other
I am not a database expert, nor an expert on DWDM connections, nor an expert on the RAID in question or on the remote mirroring software. I do however understand the end-to-end issues and how things should work, and I believe I know I/O better than most. I went down the wrong path on this problem a few times, but was able to back up and start over and finally arrive at the correct culprit.
The case described above was a relatively easy problem to resolve, as it basically boiled down to a configuration issue rather than a hardware problem, software bug, or programming error. And while the database program in the case mentioned remains a performance bottleneck due to the fact that it’s poorly written, the customer can now limp by until it can be rewritten.
I illustrate this case to point out that whatever the problem is, the key to resolving it is being able to look at the issue from end-to-end and doing so without getting bogged down in the politics and disparate communication coming from the various players. It also requires the skills set to at least know at a high level how things should work and what to look for when they do not work correctly.
It’s also imperative to ask many questions and be able to efficiently and effectively process the results. That means looking at the forensic evidence and drawing conclusions — even if they’re sometimes wrong — and being willing to back up and start over again if necessary. These are just a few of the more important qualities involved in being able to resolve a finger-pointing exercise.
With hardware and software not being purchased from a single vendor nearly as often, finger-pointing is likely to continue to occur more frequently in our industry. In other words, with the Chinese food menu approach of picking hardware and software from different vendors becoming more and more common, finger-pointing is increasing in parallel.
The key points to take away are that being prepared for successfully navigating these types of issues means having people with the right skill sets to deal with the problems when they arise, or being able to find the right people to solve the problems at hand.
Be prepared for the hard problems that sometimes take weeks to resolve. These issues often involve data corruption, so expect them to make for quite unpleasant and often messy situations. And while you can always hope they’re simple configuration issues that can be swiftly resolved, don’t count on that happening very often.