Solving the Storage Error Management Dilemma
And many storage devices today support SMI-S, the Storage Management Initiative Specification developed by the Storage Networking Industry Association (SNIA).
The question I have had for a long time is whether these management initiatives meet all the needs of storage administrators. The more I look at some of the strange problems I have faced and some of the stories I have heard from customers and co-workers, the answer is a resounding 'no.'
It has taken decades for network error management frameworks and error functionality in various stacks (ICMP, IP, TCP, SONET and Ethernet among them) to mature to meet requirements. SNMP 1.0 has been around since May 1991, and is implemented by RFC, the standard IETF method for implementation.
So what is missing? Personally, I think there are two big elements that are missing from the error management framework for the data path:
- Detailed insight into storage devices
- Detailed information on channel error rates from each of the connections
Storage Device Error Details
Details on error information for both disk and tape drives are actually tracked. If you have a moment, you might want to take a look at this article on flash drives to get some background on SMART (Self-Monitoring, Analysis and Reporting Technology) that is used in disk drives. As for tape drives, error information is kept for both the drives and within the drive for the tape cartridge, so it is actually possible to track error conditions. The problem with both cases is it's not as easy as it first appears. Let's look at issues with both tape and disk.
All tape drives have track errors, just like any piece of hardware. In addition, all tapes have both errors and a life span. As you get closer to the end of life for a tape, you are likely to get more and more errors. These are mostly soft errors and they eventually become hard errors, which means you cannot read your data. So how do you find these errors and address soft errors before they become hard errors?
This is, of course, easier said than done. Tape errors statistics are drive dependent. What you need to be able to do is send a special SCSI command called a pass-through command to the drive, which is a low-level drive command so the drive can report the error information requested in the SCSI pass-through command. This error information can be collected for the drive and also for the tape cartridge in the drive when the data is collected, so the errors and the commands to collect the error statistics on an LTO drive might be different than on a Sun T10000 tape drive.
This is pretty complex, and for some of the tape drives and libraries, this is not documented and you sometimes need a non-disclosure agreement to get the meaning and the location of the various errors for both the tape drives and libraries. This, naturally, is an opportunity for a software product, and a number of vendors have products that collect and display this type of data for different tape drives and robots. These products have different features and abilities and displays. Some of the products scale better than others for large environments, but you have lots of choices. These products are extremely helpful in understanding the soft errors in your environment, and they allow you to proactively address these soft errors for tapes, drives and robots before they become hard errors. Using these products is very important in large environments.
So what's wrong with this picture? Do these products integrate into the error management framework for the rest of the environment? Other than some SNMP alerts and alarms, getting the data into a single management framework is no easy task.
With disk hardware monitoring, you have a similar problem. Disks have a common set of error values that is collected and defined by SMART technology. If you have JBOD or lower-end RAID, you might be able to buy packages that will allow you to collect this SMART data.
What about those of us who have large RAID systems from the major vendors? All of those vendors are monitoring the SMART statistics and proactively failing drives based on information they receive from the drive vendors, statistical information they have gleaned over the years, and in some cases requirements for performance, as some vendors opt to replace drives rather than accept slower performance for retries. This is especially true for some vendors using SATA drives. All of this is well and good, but you have no insight into this, as this is all done and managed into the RAID controller and you never see any of this.
So once again, the question is what is wrong with this picture. Well, I have a number issues and concerns.
- As Sir Francis Bacon said, knowledge is power. I want to know what is going on in the RAID controller and what decisions are being made and why for failure of the disk drives.
- What do RAID vendors do on what they have seen before in general, not some N-cases? Over the last 10 years, I have seen a number of times, especially early in a new drive's release, where the failure rate is very high. If I knew the statistics, I could have been far more proactive with the vendor about these failures (of course, they probably don't want me to know).
- None of the error information in integrated into the environment, as all I get is likely some SNMP alerts or potentially some more details if I log into the RAID controller itself.
For these reasons, I would much rather have the RAID vendors provide me with data on what they are doing under the covers so that I can make some better decisions. The problem is how do you get all of this information into an enterprise monitoring framework? The answer is: not easily.
Channel Error Rates
Fibre Channel and a number of other technologies have a channel error rate of 10E12th bits, but are correct to a much higher number with error correction codes. From what I have heard, Fibre Channel is corrected to about 10E21st bits. This means that about every 10E21st bits, an error gets by that is either not detected as an error, or the error is mis-corrected.
That is a lot of bits and a good thing, but the question I have always had is what happens when the channel begins to degrade (see When Bits Go Bad). If the channel has an error rate of 10E12th bits and begins to fail, how does this affect the corrected error rate of 10E21st, and when does the channel fail? At an error rate of 10E11th or 10E10th? I cannot seem to get the answer to either, at least publicly. Whatever the number is, the correct error rate drops in a significantly non-linear fashion. Again, I can find nothing public in this area, but it is likely a big drop, something like 4 or 5 orders of magnitude would be my guess. This is why I want to collect this type of information and be able to correlate down the whole data path.
There are actually a bunch of error statistics and information available throughout the whole data path, the problem is that there is no common way to get at all of this information in a single management tool. All too often I have had to go from tool to tool to script to figure out problems and correlate one thing to another. With the increasing complexity of the storage environment, it sure would be nice to be able to associate all of the data path errors and warnings together along with the low-level data. SNMP alerts and alarms are just that — they do not give you enough information about what led up to the alert or alarm almost all of the time. Maybe I am asking too much, but it sure would make a lot of people's lives easier.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
See more articles by Henry Newman.