Resolving Finger-Pointing in Storage Page 3
The On-Site Visit
Next came the on-site visit. I arrived at the site and started to look at the weekend run of the performance data. Clearly something had changed. It seemed that someone in the storage group — the group that controlled the RAIDs — had decided to change the switch between some of the machines and the RAID, allowing a less contentious connection between the database machine in question and the RAID. Performance had slightly increased as a result.
I asked a few questions about the database and someone piped up that the performance problem was nowhere near as bad before they started the remote mirror. Ah, finally a big clue. This led me to ask whether or not the performance before the remote mirror was acceptable. An answer of yes prompted a follow-up question of whether this was a mirror problem. No way, the client reported, saying they have 2 Gbps channels, mirrored asynchronously, and only created about 60GB of table space.
To this response I asked about indexes and whether the redo logs were on the same devices. The client confirmed that they recreated the indexes and that the redo logs were on the same device as the table space.
Large flashing lights started going off in my head! I asked to see the FC switch. I noticed the configuration was not 2 Gbit but 1 Gbit and asked why. The DWDM consultant told us that it was not working without errors at 2 Gbit, so they had decided to scale back to 1 Gbit. With more red flashing lights going off, I look at the switch GUI and find that only 16 Fibre Channel buffer credits are allocated to the port with the DWDM connection.
For each buffer credit you can queue a single Fibre Channel frame of 2 KB to the other system. Given the length of the fibre, you need enough queued I/Os to the entire channel to contain enough 2 KB fibre frames on the channel until the time you get the acknowledgement from the first fibre frame. Even the speed of light has latency, and is of course slower in fibre than in air.
As John Mashey said, “Money can buy you bandwidth, but latency is forever.” Given the latency of this connection and the distance, the client needed to have far more than 16 outstanding I/Os, and actually needed more credits than the switch vendor Z supported, although that information wouldn’t be readily forthcoming from the vendor.
The problem was resolved by changing the configuration of the switch, but how did this happen in the first place, and why wasn’t it diagnosed long before my arrival?
I believe a number of factors contributed to the problem remaining a mystery, including:
- Lack of communication among the vendors, as there was no integrator
- Lack of a single system architect overseeing and responsible for end-to-end issues
- Infighting among the PS organizations, which led to them not talking to each other