Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
The POSIX file system interface isn't up to the task of managing today's data, resulting in costly fixes for users to solve problems like data integrity and regulatory compliance. It doesn't have to be that way.
The problem with management of files is just that: they're being managed as files, not as information. The standard POSIX information is far too basic. There are applications like Google Desktop that help you find what you're looking for, but that solves only part of the problem.
Like it or not, the file system interface that everyone has standardized on is POSIX. The interface that we all expect file systems to support is the open() system call, or if using the C library interface, the fopen() call. There have been no extensions to the call and the information that can be stored in a file for 20 years, or maybe since POSIX began. Today, we need lots of other information for data provenance, backup and archiving, user metadata, file reliability, and lots of other things.
What is happening is that data storage vendors are solving all of these problems in the user space and not as part of a standard, and therefore none of it is portable. Some SNIA members are pushing XAM, but it is only a few members and the file system integration remains unclear. Will cp or ftp transfer the XAM information?
What is needed is an industry-wide discussion and a group (or groups) taking the lead to fix the problem. Having a bunch of vendors go to SNIA meetings is not going to solve the dilemma we face. We need data management to be standardized by defining a common set of attributes that can be moved around for a file or set of files and a standardized framework that everyone supports to do it within. I have seen some of my customers doing the same thing over and over again, defining user-level metadata for files, backup and archive policy, and getting charged a small fortune for implementation of these requirements in a database or other software framework only to find that it is not portable and never will be. Most of this work was in what I call the preservation archive community, which has specific requirements for the preservation of information.
What POSIX Offers Today
POSIX today contains information on each file on access time, creation time, user, groups, and permissions. That is about the extent of what POSIX provides. All of this is accessible via the stat() system call and nothing else. POSIX basically defines extended attributes using the getattr() and setattr() system calls. This is a standard way to extend the information that is carried in a file.
The problem is that the framework exists for a file system to populate the attributes, but there is no common set of attributes that works across file systems. Take the following example. Let's say a vendor wants to support an HSM interface. There is a common framework called DMAPI that many HSMs use. The vendor might implement the DMAPI standard using POSIX extended attributes and populate the DMAPI information as part of those extended attributes. When the file system can't open the file, the file system will then check the extended attributes and find that the file is under HSM control. What if you copy that file to another file system or a different operating system? If you are using the standard copy commands like cp or you ftp the file, all the information about the file and in the extended attribute will be lost.
Here are some of the variables you will find from the standard stat() call:
- Time of last access
- Block size for file system I/O
- Number of blocks allocated
- Time of the last change
- Group ID of owner
- File protection modes
- Time of the last modification
- Number of hard links
- Total size in bytes
- User ID of the owner
This is not enough information to manage a file over its lifespan. Whether it is house plans, a digital picture you took three years ago that you want to keep for 30 years, or even your tax records that you need to keep, this is just not enough information to allow you to track and maintain the file. Think about what happens to the hundreds of billions or trillions of files that are business records, medical records, government records, scientific data or other types of vital information. These files are generally managed through another application such as a database, but that does not make it the right way to do it.
A POSIX Proposal
The information POSIX provides about a file is far too limited and needs to be changed. There have not been any changes to this standard in a very long time, so I'll propose some. As part of a standard POSIX file, you should have the ability to populate the following types of information in a file:
- T10 Data Integrity Field (DIF) Support: Right now, there is no way for an application to populate the application part of the DIF field in a standardized way.
- A per file checksum such as a SHA256: This is critical for many preservation archives to check the integrity of the file and is required by many organizations for data integrity.
- Metadata about the file that both a user and application can populate, such as what version of an application created the file.
- Standardized backup and HSM interface information: Should this file be backed up? How many copies should be made by the HSM?
- Provenance: This is critical for integrity of the chain of custody of a file.
That list is just a subset of things I believe are needed as a standard part of a file that is moved from machine to machine. We all pay the cost of not having this type of information in a file, as we are paying for vendors to write user space applications to track this type of information on UNIX systems. The funny thing is that the IBM (NYSE: IBM) mainframe people are laughing at us, as MVS has had this type of information for years. I would implement this using POSIX standard attributes and define attribute groups for each type of attribute. This would enable these features without major system changes. Commands would have to be added to populate the attributes and move them around, but this is not that difficult.
Obstacles to Change
So why has no one done this yet? I think there are a number of reasons:
- Vendors do not want to make changes, and any changes to the POSIX standard would require both file system and operating system changes.
- Vendors are making money on user space applications. This is a poor reason, but I believe it is likely true. If something was standardized, then you couldn't sell it.
- These changes would generate more overhead, as you will have additional space for attribute data. This will increase the size and space needed for file system metadata (inode data) and will increase the time to open and read files.
I don't think any of these are good reasons for staying on the same path. Vendors make changes to systems all the time, so the first reason is not a very good one. Vendors will make money one way or another, as they need to and should, so this should not be an impediment. The extra overhead is not that great, as most files are large and the metadata space is small in comparison, and the difference in time to read 512 bytes (the size of many inodes) compared with 1024 bytes is extremely small.
The final obstacle to change is that the user community is not involved in standards, so it's up to the vendor community to take this on. In my opinion, the only hope for change is with the people who are spending large sums of money on user space applications, both government and industry. They need to get involved.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 28 years experience in high-performance computing and storage.
See more articles by Henry Newman.