Over the last few months we have reviewed the data path from the application through the operating system. The next step in the data path is the file system, and the volume manager that accompanies most file systems.
The file systems as we know them today trace their roots to a proposal for the Multics operating system in 1965. (See http://www.multicians.org/fjcc4.html for more details.) This proposal for all intents and purposes became the Unix File System file (UFS) in the early 1970s and remains the UFS as we know it today. Not much has changed in file systems over the last 35 years -- they recover faster and support larger files and file systems, of course, but these are evolutionary changes rather than radical changes. Compare the changes in file systems to the hardware changes that have been made over the same period of time, and you'll see only incremental change overall at best.
In my opinion, the file system (and the associated volume manager) is the most critical component in the data path due to its ability to dramatically affect I/O performance. Even the best file system and volume manager available today can be improperly configured to the point that performance is horrible.
File System (FS) Basics
The purpose of file systems is to maintain a consistent view of storage so that we can effectively manage it. This is done in a way that allows the users to create files and directories as well as delete, open, close, read, write and/or extend the files on the device(s). File systems also maintain security over the files that they maintain and, in most cases, access control lists for a file.
Volume Manager (VM) Basics
Volume management was developed in the late 1980s to enable the creation and management of file systems larger than a single disk, which offered the possibility for greatly improved performance. Since almost all file systems at that time could only mkfs (create a file system) a single device, volume mangers provided file systems a single set address range for multiple devices. Other volume manager enhancements in the 1990s included support for software RAID 1 and 5.
Standard Volume Manager (VM) Inner Workings (Striping)
Most file systems require a VM to group disk and/or RAID devices together, typically via striping. Striping spreads the data across the devices based on the stripe size set within the volume manager. The idea behind striping is to spread the data across multiple devices to improve performance and to allow multiple I/O disk heads to be seeking simultaneously. It should be noted that some volume managers support something called concatenation, which starts with an initial device and then begins writing to a second device only after the first has become full.
The following examples show what happens under standard striping when writing multiple files at the same time (first illustration) and what happens when one of those files is removed (second illustration).
File Systems that Maintain Their Topology
Some modern file systems maintain and understand the device topology without a volume manager. These file systems support both striping and something called round-robin allocation. Round-robin allocation means that each device is used individually. In most cases, each file open moves to the next device. In some file systems, it could be that each directory created moves to the next device. Here is an illustration of round-robin allocation:
As we will see, round-robin allocation has some other important implications for performance as well.
File Allocation Comparison
One of the main reasons that many volume managers do not provide a round-robin allocation method is due to the interaction between the volume manager and the file system. Every file system must allocate space and maintain consistency, as this is one of the main purposes of the file system. There are multiple types of file system allocation, but the real issue is that a volume manager presents a single set address range for the block devices in the file system for the file system to allocate from, and the volume manager translates the address to each of the devices. It is difficult but not impossible for the volume manager to pass all of the underlying device topology to a file system. Also, just as important, most file systems designed with volume managers do not have an interface to understand the underlying volume topology.
How Volume Managers and File Systems Work
To choose the best file system for the application environment, it is important to fully understand how volume managers and file systems work internally. By understanding the inner workings of the software, you will have a much better idea what the tunable parameters represent and how to improve performance with those tunables for your available hardware and the application environment.
Each file system supports a method of how space is represented for files within the file system. The two most common methods are:
- Indirect blocks
If the file system is using extents-based allocation, space is allocated within the inode for block address locations for the file data. For most file systems, 15 extent addresses are used for the data in the base inode, and the last address in the inode is linked to another inode where an additional 15 extent addresses are available.
With indirect block allocation, the last block of space allocated for a file points to the next allocation. Space is allocated for the data in the base inode, and the last block of the allocated data space points to the next allocated space. Indirect blocks were originally designed for the UFS file system in 1970, when disks drives were slow and seeks were not required to go back to the inode and allocate additional space.
Here is an example of how Sun's UFS file system does allocation:
http://docs.sun.com/db/doc/801-6631/6i10bkb1a?a=view offers a helpful write-up on UFS internals.
Indirect block allocation and read/write performance is generally slower in comparison to the extents-based allocation method. I am unaware of any modern file systems using indirect blocks for space allocation because of the large performance penalties for random I/O. Even for sequential I/O, the performance for indirect blocks is generally still less than that of extents-based file systems.
Free Space Allocation and Representation Methods
Each file system uses an algorithm to find and allocate free space within the file system. Most file systems use binary trees (btrees) allocation to represent free space, but some file systems use bitmaps to represent free space. Each method of free space representation has its advantages and disadvantages.
The use of bitmap representation is less common. In this method, each bit in the map represents a single allocation unit such as 1,024 bytes, 512 KB or even 100s of megabytes; therefore, a single bit could represent a great deal of space.
Some good information can be found at:
Binary Trees (Btrees) Representation
A btree is basically a sorted list of all the free allocations and used space for the file system. It is important to understand how the space is allocated from the tree. Some good reading on btree allocation, which is used in most file systems to find free space, can be found at:
Free space Allocation
With each allocation type (btree or bitmap), free space must be found and allocated with the representation method. These allocation methods find the free space based on their internal search algorithms. The two most common methods used are first fit and best fit.
The first fit method tries to find the first space within the file system that matches the allocation size requested by the file being allocated. In some file systems, the first fit method is used to find the space closest to the last allocation of the file being extended, thereby allowing the allocation to be sequential block addresses for the file within the file system. Here is a good example of first fit:
The best fit method tries to find the best place in the file system for the allocation of the data. This method is used to try to reduce total file system fragmentation. This method always takes more CPU cycles than first fit, as the whole file system must be searched for the best allocation. (Note: In systems that use round-robin allocation, only the device that the initial allocation was made on needs to be searched.)
This method also works better at reducing fragmentation, especially on large (multiple megabyte) allocations or when files cannot be pre-allocated (for file systems that support this method). Most vendors do not support this method, and most allocations in file systems are not large, as the overhead would be huge. The old Cray NC1FS supports this method by using hardware vector registers to quickly do the search.
Here is a good example of best fit:
In general, btrees do a better job for small allocations, but then the files could become fragmented. Bitmaps are better for larger allocations, but the time for allocation can be much longer to search the map for free space. Both of these methods can be optimized for your operational requirements by tunable parameters in the file system and volume manager as well as by RAID configuration and allocations, which we will review in a few months. While some file systems have a hybrid approach to allocation, these are generally proprietary and require a non-disclosure agreement to completely understand their inner workings.
You should now have a good understanding of how file systems and volume managers work internally. Next month we will apply this information to specific tuning issues and to understanding file system and volume manager tradeoffs and choices. As we follow the data path, we'll then move on the RAID devices and discuss how to put all of this information to good use.