Too Much of a Good Thing: Managing Information Overload in Storage Management
When managing storage and other network elements, you can easily end up with far too much of a good thing. Servers, routers, switches, desktops, firewalls, intrusion detection systems -- each produces a wealth of information detailing every aspect of its performance, as well as the performance of related network elements. The result is that you end up with an overwhelming amount of data. A vast sea of unimportant alerts within device-specific logs masks a handful of vital alerts that require immediate analysis, coordination, and priority attention by administrators.
"Our admins will go in and look at the logs to see what happened before a server locked up," says Steve Luciano, Network Administrator for New Pig Corporation, an industrial safety and plant maintenance vendor headquartered in Tipton, Pennsylvania. "But it's difficult to keep on top of all the servers amongst everything else they have to do."
New Pig searched for a means of presenting storage and networking information from disparate sources in a useful and centralized format. This led to the company acquiring and installing Event Log Management (ELM) software.
The key element to track when managing storage systems is, of course, the disk drives.
"You have to understand that disk drives are like light bulbs," says Paul Santeler, VP of Management Networking and High Availability Products Group at Hewlett-Packard. "They will fail. It is how well prepared you are when one fails that makes the difference between a well-run or poorly run data center."
To help in preparing for possible upcoming failures, disks use a system called Self-Monitoring Analysis and Reporting Technology (S.M.A.R.T.). S.M.A.R.T. monitors up to thirty different items within the drive, including seek time, head flying height, the amount of time it takes to spin a disk up to its rated speed, and the internal temperature of the drive.
S.M.A.R.T. analyzes all these monitored elements and creates an overall health assessment for the drive based on algorithms the manufacture establishes for that particular model. When it appears a device is approaching the failure point, S.M.A.R.T. alerts the administrator in (hopefully) enough time to back up the drive and replace it. If the disk is part of a RAID array, there is an additional level of protection.
"When there is a failure coming, the S.M.A.R.T. drive passes that information to the RAID controller," says Santeler. "But RAID does its own analysis as well, monitoring hundreds or thousands of things on the drive itself to try to see as a whole what might cause failure."
But drive status is just one part of ensuring the availability and performance of storage systems. A complete view requires an end-to-end view of the entire process as it affects the end users. Therefore, it is wise to also keep tabs on other sources of information, including:
FECN/BECN - FECNs (Forward Explicit Congestion Notifications) and BECNs (Backward Explicit Congestion Notifications) are Frame Relay messages that notify the receiving (FECN) or sending (BECN) device that there is congestion in the network.
SNMP - SNMP (Simple Network Management Protocol) lets administrators monitor and manage such items as CPU utilization, available disk space, temperature, up or down status of devices, connections or services, excessive errors on switches/routers, server fan failure, and bandwidth utilization.
Security Threats - This includes password hacking, stealth and port scans on firewalls, application failures due to viruses, and login authentication failures stored in firewall or other security logs.