Management of data over its lifecycle is a vexing problem. The first questions that come to mind are how long should that life be, and who should decide? The lifespan of data can have different meanings for different people in the same organization, along with different reliability requirements, which could require different copies with different management policies. Add to this the constraints imposed by various operating and file systems, and you have a really big problem on your hands.
This is the first article in a two-part series on data lifecycle management. This article will attempt to define the problem, and the next article will discuss potential solutions — most of which have not yet been developed.
We have so many problems with the current framework for data management that it is hard to think of a good starting place. Here are a few of the problems that most users deal with in the current UNIX, Linux and Windows environment:
- Operating and file system implementations, and
- Current archive technology.
Data lifecycle management problems can be solved to some extent by database technology or products like EMC’s Centera, but they are still limited to what the underlying operating and file systems support, so vendors either have to do a whole lot of extra work to solve the problem or take some shortcuts.
What Users Want
Recently I have been involved in a number of projects involving multiple-petabyte archives. The managers of these archives have some common requirements:
- The people who own the archive want to have users create the files, but have those files eventually owned by the archive with different polices.
- The archive is not fixed, and as technology moves forward, they want to easily migrate the data forward to new technology.
- Security of the data is important, but security needs go far beyond current UNIX and other implementation of group permissions.
- Data integrity is important, since loss or corruption of a single byte can make a file useless.
- Polices for when all or parts of the data are no longer useful need to be implemented, and in most cases this might be different for different types of users.
People who manage data are facing these problems and many others. Data management polices are not coordinated among systems, since no common framework exists. Hierarchies of storage, with different storage policies that go far beyond what current HSM technologies support, is needed, but the infrastructure that could make this a reality is missing.
Operating and File System Implementations
There are a number of problems in Windows, Linux and UNIX operating systems that prevent the full management of data.
- Both operating and file systems manage file by users, with user and group permissions.
- Nothing in the operating system or file system allows the file to have any file metadata describing the data and the policies associated with that file. (Note that this is different than file system metadata, which provides information about the location, permissions and ownership of the file.)
Users and Groups
UNIX-based operating systems and Windows are limited by the concept of user and group ownership. More so UNIX, which has hard POSIX defined definitions of users and groups and how permissions work for both.
Having a file system and operating system that associate ownership of all files for a user can be a bad thing. What if that user leaves suddenly under less than desirable terms? What if that user is working on a joint project with a number of other users? Can you have a common repository for all of the files?
There are now a number of “groupware” tools available that can manage this process, but shouldn’t basic operating and file system management not be constrained by archaic UNIX user and group concepts? I believe that the current design in UNIX and Windows systems does not provide the needed infrastructure.
File Metadata
At this point, you might want to brush up on the issue of file system metadata. Here’s two suggestions: Choosing a File System or Volume Manager and Storage Focus: File System Fragmentation.
Current file system metadata structures provide little or no information about what is in the file. All the current metadata structures provide is the location of the file. If someone wants to describe information about what is in the file, how it was written, data structures, and a myriad of other information that will be important if someone wants to actually be able to read the file in 20 years, today this is done by creating a database generally separate from the file system and files in question. Now you have two things to maintain, upgrade and migration.
A number of major storage vendors such as EMC, IBM and StorageTek have recognized this problem and have developed products to address these problems. Of course, each of the systems is proprietary, and interoperability between products is nonexistent. There are no standards in this area currently, so there is little you can do if you need this kind of technology except to buy what he vendors are offering.
Archiving Technology
While some might argue the point, utilizing tape to store most large archives results in far less O&M costs than storage on spinning disk or even emerging MAID (massive array of inactive disks) technology. I base this statement by looking at the cost per GB of disk and tape, including compression; the cost of power and cooling; the cost of tape drives and robots; and reliability costs.
One thing I did not mention is the cost of software. Because most large archives under the control of HSM systems use a common interface to access the archive called DMAPI (Data Management API), which was created by the Data Management Interface Group.
Development of DMAPI began in the early 1990s and was completed in the late 1990s. A number of forces came together that required vendors to let customers access their data regardless of which file system and HSM product were being used. Before DMAPI, all HSM vendors had proprietary interfaces that made migration to a new system difficult and painful, particularly for large archives.
Conclusions
There are no general solutions or standards that allow sites to manage data over long periods of time. A number of vendors have created proprietary solutions that might meet your requirements, but without standards and the underlying infrastructure in the operating and file systems, moving from one vendor to another will be very difficult. I have been involved in a number of brute force migrations. They always take far longer than planned, and cost more too.
Given that storage densities continue to grow faster than storage performance, this makes migration increasingly difficult. Here’s a table comparing density and performance growth for disk and tape over the last 15 years:
Type | Density Increase since 1990 | Performance Increase Since 1990 |
Seagate Disks | 600 times | 29.5 |
Tape | 7500 | 64 uncompressed 104 compressed |
With at least an order of magnitude difference between density and performance over the last 15 years, even if these ratios begin to change, we are still far behind on the performance curve. The disturbing part is that it looks like the trend may even worsen over time.
Next time we will look at some technologies that could provide some relief to these problems.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 24 years experience in high-performance computing and storage.
For more storage features, visit Enterprise Storage Forum Special Reports