Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Debugging in the Real World
I was recently working with a customer on a problem where we were getting corrupted data and zero length files in a heterogeneous shared file system environment with dual HBAs, dual switches, RAID, and Fibre Channel tape. The first step was to figure out what was happening where and when, and correlate the log files.
This became a problem, though, as this was a new system, and the NTP daemons (network time protocol) had not been set up to run on the servers, IP switches, Fibre Channel switches, and RAID. The first step to debugging the problem was to match up the log times based on actual time and figure out what was happening and when it was happening. (Step 1a was getting the customer to ensure that NTP was running properly for future debugging.)
After matching the log files, we were able to determine a pattern for a number of problems and error conditions, all of which pointed to a hardware problem in the switch that only happened when the user application was performing asynchronous I/O to the client file systems. In other words, you had to fully understand the application, shared file system, HBAs, etc. – the entire data path – to discover the source of this problem.
The promises from vendors that everything will work together all of the time and that it will be easy to put together have yet to be realized. If you buy everything from a single vendor, you can generally be assured (at least if it has been sold for a while) that it will work together. On the other hand, if you or your manager decides that you're going to be the integrator, you need to pay careful attention to some of the basic interoperability issues for the hardware and software components.
Drivers and firmware compatibility issues continue to plague us. Most of the time everything works, but again, the key word is "most." The real problem areas that prove to be the most difficult are in developing High Availability (HA) systems with shared file systems and HBA, switch, RAID, and file system metadata failover. These systems almost always have the largest interoperability issues given the complexity and, from what I have seen, the lack of sufficient testing by the vendors. In the vendors' defense it takes a huge amount of money to maintain an interoperability lab just from the hardware and software perspective and even more money for smart people to run the lab.
Testing HA interoperability with shared file systems is very hard. Who is supposed to do the testing? The HBA vendor, switch vendor, file system vendor, tape vendor, who? You will likely get promises from the sales people from each of these companies. I guess Ronald Reagan said it best, “Trust, but Verify.”