This month we take an in-depth look at the internal workings and implementations of file systems, the next stage in the data path as part of our series on storage.
In the last column we briefly introduced some of the internals of file system and volume managers; this month we will review and go into more detail on the important concepts as well as cover several new topics.
File System Services
A file system provides a management framework for the external storage, whereas memory (internal storage) is managed by the operating system itself. The file system allows you to manage data via a framework called files (shocking, huh?). File systems:https://o1.qnsr.com/log/p.gif?;n=203;c=204660765;s=10655;x=7936;f=201812281308090;u=j;z=TIMESTAMP;a=20400368;e=i
- Manage the available space on the devices that are under the file system's control
- Allocation maps
- Where the files reside on the storage
- Removal of files
- Support in some cases via hierarchical storage management (HSM) the placement of those files on secondary storage
- Provide access control via:
- Standard UNIX/POSIX permissions
- Access Control Lists (ACL)
- Support standard UNIX/POSIX interfaces like read/write, fread/fwrite, and aioread/aiowrite
- Support (in some cases) for feature functions like homogenous and/or heterogeneous file sharing and special I/O for databases
- File locking
- Support for NFS
- Special functions which allow the creation of a file without writing the data sequentially (from the start of the file to the end)
File System Functions
Because of data consistency issues, file systems have historically been checked after system crashes. As file systems grew and disk performance did not, given the physical limitations, the time to check the file system after a crash became longer and longer. I remember back in 1992 waiting for 11 hours to fsck(1M) a Cray file system that I personally crashed (the customer was not happy and to this day still remembers -- sorry, Bill). A number of technologies have been developed since that time to help.
How Does This All Work
When you mount(1m) a file system, what really happens? The mount command is a special command used for each file system type. The basic structure of the file system is written to the device(s), and when you type
mount, this basic structure is read from the raw device and processed. What is read is usually called the superblock for the file system. This is generally a special area for the file system that contains information on things like:
- What devices are used
- In what order the devices are used
- Characteristics of the file system such as allocation, performance, and strategies
- Allocation maps
- Location of the inodes and directory blocks
Some file systems are bootable, meaning the system can know about them at boot. This usually requires that the system BIOS or boot bios knows about the structure of the superblock and how to read it.
Occasionally, the file system must be checked for consistency before it can be mounted.
I believe log-based file systems were first discussed in a USENIX paper by Rosenblum and Osterhout entitled The Design and Implementation of a Log Structured File System, published in 1990. The authors analyzed I/O usage and concluded that:
- I/Os are short
- Most I/Os are random
- All data is fragmented on current file system technology (BSD and other file systems)
- Disk access speeds have not changed much
Since that time, file system vendors such as ADIC StorNext, Compaq/HP (ADvFS), Veritas (VxFS), SGI (XFS), IBM (JFS), Sun (UFS Logging) and a myriad of other vendors and Linux file systems (EXT3, ReiserFS) have taken the original concept and modified the concept to log only metadata operations. The goal is to ensure that the file system metadata is synchronized when the log area becomes full, and if the system crashes, the expectation is that only the metadata log will have to be checked for consistency after the boot rather than all of the metadata.
This file system check is commonly called fsck(1M). This logging methodology was developed based on the requirement to boot quickly after a crash. Almost all fsck(1M) versions that check just the log can check all of the metadata. This is sometimes important if you have had a hardware problem that went unrecognized and fear that the metadata data was corrupted.
Most file systems and volume managers allow the placement of the log on a device different than that of the data. This is done to reduce the contention between the log and increase the performance of the file system metadata operations. Each time the files are written or opened, the inode is kept in the log and is periodically written to the metadata area within the actual file system metadata area(s).
When performing large number of metadata operations, logging and the logging device performance can become an issue. With logging, the file system metadata is copied two times:
- The file system metadata is written to the log device
- The file system metadata is moved from the log device to the file system metadata area after the log becomes full
This double copy can become a performance bottleneck if:
- A large number of metadata operations fill the log, the file system is busy with data operations, and the log data cannot be moved quickly to the file system
- The log device is slower than the number of log operations that are required
Most people (including me) have both positive and negative philosophical issues with logs and logging, but typically you either fall into the logging camp or the non-logging camp. There are far more "loggers" than "non-loggers." Logging is currently the only method used for fast file system recovery, although other methods are possible.
It is important to remember that fast file system recovery is the requirement, and that logging is a method to meet that requirement, not the requirement itself. If someone comes up a with a file system that does not providing metadata logging, it will be a hard sell, as everyone thinks logging is a requirement at this point.
This is a commonly used term, but what are inodes really used for? Most inodes range in size from 128 bytes to 512 bytes, and vendors often use something called extended attributes, which are really just aggregations of more information than could fit in the inode. This happens when a vendor runs out of space in the inode and needs additional space for things like access control lists (ACLs), tape positioning for HSM, security, and other non-basic functions.
The basic function of an inode is to serve as an off-set from the superblock that identifies where the data resides on the devices under the control of the file system. The inode also provides information about ownership, permissions (read only for example), and access and create times. Inodes have the same basic functionality for Linux, UNIX and Windows systems.
The concept of inodes and how they are used have been around for over 35 years, and in that time not much has changed but the size. Most inode implementations allow 15 allocations per inode, and after that another inode is used for the additional allocated space. The 15 allocations is true even for the file systems that support large inodes (those greater than 256 bytes).
File Size Distribution
How and where file systems place data on the devices (notice I do not say disks because some file systems support hierarchical storage management (HSM), which basically means the data could reside on tape or disk or both) that are managed is an important -- and complex -- concept. Data allocation algorithms and data placement (which device(s) the file(s) will reside on) can be a big issue for some file systems. As mentioned in the last article, most file systems require volume managers when using multiple devices. Generally, there are two types of representation for free space:
- Btrees, as used by Veritas, StorNext, XFS, ReiserFS, and other file systems
- Bitmaps, as used by QFS, NFTS, and HFS (MAC/Linux)
Given what I have seen for most "user and scratch" file systems, a 90/10 rule applies -- approximately 90% of the files use 10% of the space, and 10% of the files use 90% of the space. Of course, sometimes the distribution is 95/5 or even a tri-modal distribution, but the point is that you are likely to have an extremely skewed distribution of sizes and counts for files, rather than a statistically normal distribution (Bell Curve).
Understanding how allocation is accomplished and the tunable parameters in a file system gives you a better understanding of how the file system will scale and perform in your environment given the file sizes, the file system allocation sizes, and how the data is accessed (random, sequential, large or small block).
Shared file systems both heterogeneous and homogenous are becoming commonplace, and the complexity of the architecture and its management is growing exponentially. These last two columns have provided a basis for understanding how the shared file systems and volume managers work internally, and understanding that will allow you to better evaluate file system tuning parameters and discern what they really mean, as all too often the documentation leaves a great deal to be desired.