Resolving Finger-Pointing in Storage Page 2
Where to Begin
Someone who is good at finding the underlying cause of a finger-pointing exercise would make a very good detective. It needs to be someone who trusts no one without verifying the answer. As I have said before in my columns, and as Ronald Reagan said it best, “Trust but verify.”
As I have just completed a real world exercise in resolving a finger-pointing situation, let me take you through the steps I went through with a customer. (The customer’s name, associated vendors, and some of the facts have been changed to protect the innocent.)
To start, a customer calls and says they have a performance problem with a new database. They re-create the critical database every night for an internal group. They need help in resolving the problem given that:
- The server is from vendor R
- The RAID is from vendor T
- The FC switch is from vendor Z
- The database was written by a contractor, and the client isn’t sure it’s written efficiently
- The HBA is OEMed by vendor R, but is from vendor A
- The RAID is remote mirrored 10s of miles away using a dark fibre connection, but the mirror is no longer synchronous, so that should not be a problem
The client’s pleading request was along the lines of “Can you help us like right now — I am sending the purchase order.” In other words, the clock was ticking.
The first thing I asked the client was how long this issue had been going on. The response: months, but now it is critical, as management has given it visibility.
My next question was what performance tools did the client have. The client had just obtained a tool from company C and had planned to install it the following week. I asked them to install it that day and to get me the GUI for the tool.
I told the client I would need to look at the performance data for a week or two and them come on-site. I figured I wouldn’t be able to resolve the problem remotely nor would it make good sense for a complex problem such as this not to have a significant amount of performance data from each of the database update runs. Plus I wanted to get a good look at the end-to-end configuration and storage topology.
For a week I analyzed performance data and found that the RAID system was clearly SLOOOW, so I called the customer and asked a few questions about the RAID configuration, HBAs, and the switch, as they could not give me access to these remotely. Low and behold I found out that over 15 systems were connected to this RAID, but I was receiving performance data from only one of those systems.
So this made me question whether another system with another application was causing the problem and also whether the customer had the tool from vendor C on all the systems. I surmised the answers in this case would be not likely and no.