Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
Managing information can be difficult. We have operating systems and file systems that manage files and we also have applications that manage files, and yet there are currently no standards for the management of information. A number of groups are working on such standards, but they're years away from being implemented.
The open source community via StorageTek (now Sun) has an HSM product called OpenSMS, but as with other HSM products, without standards that everyone can agree on, you have to manage files the old fashioned way, which is not much fun in today's world.
So with that as background, let's take a closer look at managing large amounts of information with the limited data framework we have today. We'll take a look at information lifecycle management (ILM) from the vantage point of HSM, since metadata standards are still years away.
HSM products today can be divided into at least two categories:
- HSMs that use a native file system, such as EMC's Legato DiskXtender running on IBM servers with a JFS file system; and
- HSMs that combine a native file system with an integrated HSM such as Sun's StorEdge SAM-FS/QFS.
At issue is which HSM features and vendor you need to manage your files or information. There are four areas to consider when looking at management of large amounts of data: levels of storage; data reliability; ability to upgrade; and performance.
Levels of Storage
Almost all HSM products today support multiple copies of data, but some do not support different types of storage, or they support disk and tape differently.
Almost anyone looking at petabytes of data will have a requirement that HSMs support disk and tape. The important question is how multiple levels of disk are supported. For example, can the HSM support fast Fibre Channel RAID storage, and potentially multiple levels of slower SATA RAID storage? Add to that requirements to support multiple levels and potentially multiple copies of tape storage.
What is important about these types of requirements is how they could be used in your environment. In many environments, data has a usage pattern that might benefit from migrating it from the fastest storage available to lower-performance storage. If you are using the data from the lower-performance storage, the issue then becomes whether it should be migrated to the faster storage, and if so, when.
It is important to understand the policy issues for HSM products and how they affect your data usage patterns. You do not want data flying all over the place just because you read it one time or updated a single byte. Sometimes this is a policy within the product, and sometimes products do not have the ability to do what is needed. You need to figure out what is needed first.
Data can and will be corrupted. I remember long ago working on a computer which had memory that supported SECDED (Single Bit Error detection double bit Error Correction). Naturally, the system I was working on was getting triple bit errors, and none of the errors where being detected or corrected. The system went happily along corrupting data until the operating system crashed. Since we had no idea how long the problem had been in progress, many files that were created over a period of time were in question.
This type of problem can still happen with today's memory but is highly unlikely, but just as unlikely are failures of a NIC not detected by TCP, failures of HBAs not detected by Fibre Channel CRCs, and so on. The more data is moved around and the longer it stays on media, the greater the likelihood of what I call a data event (corruption) can happen. The potential for the problem grows with the amount of data, so two issues come to mind:
- What are the requirements for data reliability for your environment?
- What features do the HSM products have that address your requirements?
You could always write two copies of each file on different devices using different network connections, and each time read the files back and compare them, but that's not terribly efficient, and if they don't match up, which one is bad? One of the problems in this area is that standards do yet not exist for end-to-end data reliability. EMC supports a checksum byte on some of their products, but that only checks the data from the controller to the disk. This does not help much if you are reading the data back from tape and the tape passes an error, and it does not help at all if you get a triple bit error. The T10 group is looking at standards in this area, but they are a long way off. The question becomes how important is your data? If you have 10PB of data and read and write it thousands of times, with current error rates in the data path you are going to get unrecognized corrupts, and I am only talking about hardware error. File system and operating systems are another matter.
Ability to Upgrade
Let's hope I'm right that in a few years we'll get standards that will allow file tagging to turn file names into information repositories so files can be managed by a policy that is stored with the file. If these new metadata standards become available, how will you move hundreds of millions if not billions of files from the current environment into this brave new world? The various standard groups (SNIA, ANSI and others) are working to make this a reality, but it will take time.
In the meantime, a few of issues come to mind:
- Can the current software managing your environment be upgraded to these new standards?
- Are the vendors that are being used today going to support these new standards, and if so, when?
- Will the new standards require significant more hardware to meet performance requirements, and is your hardware upgradeable?
All of these are good questions for looking at long-term issues in the ILM area. Take the case of a few petabytes stored on tape. If your new ILM software requires you to read all of the data in and add ILM information to each of the files on the tape, you are going to have a very busy system.
Current HSM systems often have significant performance degradations as the number of files grow or the amount of space that is managed grows beyond the expectations of the designers and developers. As file counts explode, many of these systems have not conceived of one billion files in a file system, much less a trillion, nor have they considered the possibility of millions of file in a single directory. Issues such as these affect the performance of today's systems, so we can expect them to remain issues in the future.
And Now for the Hard Truth
As far as I am concerned, we do not have a workable information lifecycle management solution, and such a solution is at least five years down the road. Yes, there are vendors that provide ILM products, and some of them have been around for a few years, but that means you are stuck with that vendor until the end of time, and when standards eventually appear (I hope), will that vendor support those standards, or will they continue down the path they started on? Of course, lots of standards have come and gone, and you do not want to be stuck using a standard that did not make it. I am pretty sure that this will not happen this time around given the global need, but it is something to keep in the back of your mind.
If ILM really does not exist today, then we are left with each vendor trying to manage a hierarchy of storage and files, which is a poor solution to the problem of data management, but we're not completely without options. The current UNIX file semantics have outlived the current requirements. The MVS mainframe environments knew this a long time ago and dealt with it. I think this is because one vendor controlled the operating system and that vendor, IBM, listened to their customers back in the 1970s. With UNIX being fragmented in the 1980s and 1990s, and the move to Linux today, the ability for change to happen quickly is limited. But while we may be left with nothing but bad options, that is still better than no options at all.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.