A recent Storage Networking Industry Association End User Council survey examined storage pain points (See “Storage Users Speak Out”) and came up with some not too surprising conclusions: Storage costs too much and is too hard to manage, and the problem is compounded by a fundamental lack of understanding of how to architect, provision and scale storage networking technology solutions.
Not surprisingly, SNIA was quick to swing into action, launching initiatives revolving around standards, education and practices aimed at alleviating end user pain. Mark me down as a non-believer that SNIA, T11 or any other group will successfully develop a common framework to manage, monitor, and provide the needed information for storage devices. I will be pleasantly surprised if it happens, but I think I have history on my side.
It’s easy to say that vendors need to make storage easier, but that’s not how the market works. Customers don’t spend money on management, so vendors don’t spend much time developing it. And if there’s no money in it, it’s unlikely that standards efforts will succeed.
I spent some time thinking about why storage management standards are the forgotten stepchild, just now coming into existence. Let’s go back and take a look at some standards that have worked and the motivation behind why they work, and then look at some that haven’t worked and why.
Fibre Channel: A Standard That Worked
One of the better examples I know of standards that work is the Fibre Channel work done by the T11 group. The T11 group has been around since 1994 in its current incantation, and before that was the Task Group X3T9.3. At about the same time, the XOpen Performance Management Working Group was also working on storage standards, but we’ll get to them a little later.
Back in the early years of Fibre Channel there were huge interoperability problems. You had HP (now Agilent) developing Tachyon chipsets, along with some others. You had Jaycor (which became JNI and then AMCC) developing early HBAs. You had RAID vendors such as Data General (now EMC) and MaxStrat (now Sun) all working on Fibre Channel products. There were many other vendors working on the new standard, and no single vendor had an end-to-end solution from host to disk.
As you can imagine, the early days were painful, very painful. Nothing really worked with anything else. When it did work, there were still many N-cases that caused problems. Often these problems were not found by the vendors but by the early adopter customers using products in production. Many of the problems were in the error recovery area, which is not surprising given the complexity of recovery. The tools to find these errors were rudimentary, and a large number of people were spending time and money to track down these problems and fix them.
The problem in the early days was that fixes were done between two vendors, say Jaycor and Symbios (which was bought by LSI and is now Engenio), but the problem might get solved a different way by Ciprico (another early Fibre Channel vendor) or MaxStrat.
All of this settled down and all of the one-off fixes were brought back to T11, and the standard was revised to take into account all the areas that hadn’t been thought of or were done incorrectly or inefficiently. The people involved in T11 learned a great deal, as did all of us.
Within a few years, T11 had solved all of the Fibre Channel arbitrated loop problems and had started working on Fibre Channel fabric. A few more problems had to be solved when Fibre Channel tapes appeared, but they were quickly addressed. Today you can plug any HBA into any server that has a driver, plug that HBA into any switch and plug that switch into any target either RAID, disk or tape, and expect it all to work. Even a few years ago this was not necessarily the case all the time. There were many vendors developing Fibre Channel in those days; today there actually seem to be fewer, given all the consolidation.
With so many vendors and all the competition, and the fact that Fibre Channel was new and not widely adopted, it was in the vendors’ best interests to have a working standard to ensure customer adoption of the new standard.
I think the other reason the Fibre Channel standard was developed, completed, fixed and adopted is that it solved pressing business issues — Fibre Channel was expected to be a huge market, which turned out to be true.
A Standards Effort That Didn’t Work
At the other end of the spectrum, during the same early years of Fibre Channel was the development of a little-known standard from XOpen called the PMWG (Performance Management Working Group). The PMWG work began in the early 1990s under the USL (Unix System Lab) banner, and the group was developing a standard for performance measurement across platforms. The idea was that everyone measured performance statistics differently and yet called them the same.
The group had a few members from various hardware vendors and many members from various tool companies. The tool companies wanted to have a set of common measurements to be able to predict the performance of various machines, which might save the company owning the machine from having to upgrade. As someone involved in the effort, my view is that the various hardware vendors did not want the tool vendors to succeed because if they succeeded, they might delay hardware purchases, and if they added all of these common measurements into the operating system, it would be time-consuming to add the counters and the framework to extract the counters.
To the best of my knowledge no vendor every implemented the XOpen framework, much to the chagrin of the people that worked on the standard, myself included.
So why did the group working on T11 have a standard that became so successful, which in less than eight years has become the basis of all server-based storage and high-end workstations, while the PMWG standard was not adopted by anyone? In my opinion, it is all about money.
The market for Fibre Channel and Fibre Channel products is billions of dollars per year, while the market for performance management software is small. Look at all the conferences and industry groups that revolve around Fibre Channel, while I am aware of only one conference focused on performance tools and management. Perhaps I am too cynical, but I really do think that standards and adoption of standards is for the most part all about profit.
This should come as no surprise. Companies send their people to these standards group meetings. They pay for hotel rooms, rental cars, per diem and flights. If the standard is ratified, they have to implement it, including the cost of writing code, testing, maintenance and support. They need a return on their investment, as the stock holders demand it.
I think that we on the PMWG were naive not to understand the market forces behind what we were doing. Yes, what we were doing was important, yes, what we were doing could make system administrators’ and capacity planners’ lives much easier, and a number of other yeses could be written. But vendors needed to be able to make money on what we were doing. The vendors looked at what was being done and decided that they could not make money on the standard, and in fact that might lose money, so why bother with all the effort?
The user community did not get behind the standard, since very few people understood the power that performance management would give them, and yet I had heard many complaints from users throughout a number of industries. Unfortunately, these were the people in the trenches, not the ones with money to spend on tools, and you had management telling them just to buy more hardware, since it is cheaper than buying software tools. The sad thing is they are often correct, even though that is not a good strategy for the long run, only in the short term.
So where does that leave us in regard to SNIA, T11 and other management standards? Experience suggests to me that neither process will be successful because the user community is not going to pay large sums of money for management. Some will pay, but will that be enough to sustain the effort? I hope I am wrong, since storage management is plain hard work and something needs to be done, but it is not solely up to standards committees to make this change. We need to put our money where our mouths are.