As file systems have continued to grow in size, fragmentation has become an increasingly important issue. It wasn’t all that long ago that we were living with measly 2 GB file systems and even smaller files. And while vendors were next able to provide file systems up to 2 TB, file sizes were still often still limited to 2 GB.
Now as we approach the middle of the decade, large file system sizes are in many cases 50-100 TB, and some environments are looking to have 500 TB to multiple petabyte file systems within a few years. Of course, several things are going to have to change to allow this to allow this to happen. One of the developments I believe will have to occur is the replacement of the SCSI standard with Object Storage Devices (OSD) (see www.t10.org/ftp/t10/document.03/03-275r0.pdf for more information), but we’re still a number of years away from seeing end-to-end OSD products in the mainstream.
My conjecture regarding fragmentation and the ever-increasing size of file systems is:
File system fragmentation grows with file system size, and given that the sizes of file system are increasing, end users are experiencing performance problems even if they do not always realize it.
Working off this premise brings up a number of issues:
- What is file system fragmentation?
- Will this happen to my file system?
- Can it be dealt with, and if so, how?
What is File System Fragmentation?
In the simplest terms, I would define file system fragmentation as any access to any data that is expected to be sequential on the device but is not. This is a broad definition, and covers a number of areas of data access within a file system.
In the Fibre Channel/SCSI world we currently live in, disk or RAID LUNs are simple block devices. This means that they address the first block of the LUN as block 0 and the last block as a number less than 4,294,967,296, which is 2 TB in 512 byte blocks. Note that 2 TB is the current limit on the largest SCSI LUN that can be created, and 512 bytes is the size of the formatted disk sector.
Almost all file systems allocate data based on a first fit algorithm within either the LUN or groups of LUNs used for a round-robin allocation (see this article for more information) First fit means exactly that — find the first hole in the file system and place the data in that location.
Remember, file systems communicate with the storage device, and within that storage all allocations are on 512 byte boundaries. File systems can only address data less than 512 bytes and/or non-aligned 512 byte boundaries by reading in the data on a 512 boundary.
File system metadata is another issue. There are often three parts of the metadata equation that need to be addressed:
- The file system inodes
- The allocation blocks that are used when the space for allocation within the inode is exceeded
- The file system internals, often called the superblock, that can be fragmented in some file systems if the file system grows in size
Metadata is an important and often overlooked area for fragmentation, and this especially true for file systems that support HSM (Hierarchical Storage Management), as they often have 10s of millions of files and sometimes up to 100+ million files.
Will This Happen to My File System?
Will fragmentation occur on your file system? While the answer to this question is a definite yes, it isn’t the critical question you should be asking. The real question that needs to be asked is not will this happen, but rather will you see the effect. If your applications only require a limited amount of the I/O bandwidth available, then when your file system does get fragmented, you’ll often still have the available bandwidth to meet your I/O requirements. This is what is generally referred to as architecting or engineering to peak.
In this case, if your peak performance is based on a completely fragmented file system, you will not have a problem. What often happens is people buy far more I/O bandwidth than is really needed for the applications they run. One of the underlying reasons for this in my opinion is fragmentation.
If you allocate data sequentially on a modern 15K RPM disk drive and you read it back, you can easily achieve over 50 MB/sec on average reading the entire device, but what if the data was not read sequentially? For example, consider an application that reads 8 KB blocks completely randomly, and the device has an average seek time plus average rotational latency of .007 seconds:
|1/||(8192/||(1024 * 1024 * 50)||+||0.007))|
|Block size||50 MB/sec transfer||Average seek + latency|
In this case, you could only read about 140 8 KB blocks per second, or 1.09 MB/sec (1/45th) of what you would be reading on average for a sequential I/O. Of course, this is the reason we have multiple 200 MB/sec channels and large numbers of disk drives in our systems. Almost all applications cannot use nor read/write data using the multiple 200 MB/sec channels, but they are needed to support the small random I/Os that are generated in many applications.
As data sections of file systems grow in size, searching for information becomes more of a time-sensitive issue with files being created and removed more often. For small files this is typically not a problem, as small files often fit in a single file system allocation block. Large files, on the other hand, can be a big problem.
The definition of a large file depends on the underlying allocation size within the file system. If you have a file system where the smallest allocation is 25 MB and your average files are 100 MB, then I consider this a small file rather than a large one, as you only need to go back to the allocation routines at most 4 times.
On the other hand, if your smallest allocations are 4 KB, then I would consider the same 100 MB file a large file given the number of times you’re likely to have to go back to the allocation routines. Each time you go back to the allocation routines to search for free space, you are competing with other processes for space allocation.
Remember, space allocation is most often first fit, so the space you get depends on what files have already been allocated and what other processes are asking for space. This all goes back to my original conjecture that most applications either do not have a need for full bandwidth from the file system and/or most architectures are developed with enough disk drives and channel IOPS (Input/Output operations per second) to support the needs of most applications.
The fragmentation of metadata has only recently become a major issue. As file systems have grown, so has the file system metadata space required as well as the resulting performance issues. One of the first major architectural attempts to improve metadata performance was done by someone at Cray Research back in the late 1980s, and that was the separation of data devices and metadata devices (see this article for more information).
Since that time, most file systems that are used in large environments separate, or at least have the option to separate, data and metadata, and most of the same file systems that log operations also have the ability to separate the log to a separate device.
For most file systems that I have looked at, metadata allocation is sequential, just like data allocation. If you create a file and then another file, you will use sequential inodes. If you then remove a file, that inode is available for a first fit allocation. In removing and adding files from many directories and replacing them, and then doing this over and over again, you have effectively produced random allocations for the metadata.
Running a command such as ls -l would then be required to read the metadata for each file in alphabetical order, and would likely include a small read (often 512 bytes), an average seek and average rotational latency, and another read. Basically, we could do about 142 of those reads per second. Not much different than what we could do with the 8 KB I/O before. Of course, the reason is that the seek and latency time is much higher than the transfer time for the data. So the performance in this case would be quite bad.
Of course, this is all purely theoretical, as there are numerous other factors that play a part, such as RAID cache, RAID readahead, inode readaheads, inode allocation sizes, inode cache, and others. On the other hand, there are other factors that hurt performance, like latency to the terminal and contention. I know of one site where the ls -l in this example took almost 400 seconds before the client dumped the metadata and restored it. This process rewrote the inodes for each file and each directory in order, after which the ls -l command only took only 25 seconds.
Some file systems like NTFS and others have defragmenters that defragment the data space and sometimes the metadata space — notice I do not mention metadata, just the space. (NTFS allocation and fragmentation is a topic unto itself.) The performance degradation caused by fragmentation does not generally happen immediately, but is rather slow and happens over time, making it even more difficult to diagnose. Usually, at some point the file system reaches a state where the performance is at a steady level of poor performance because the file system is as fragmented as it can get.
Some file systems — often those supporting hierarchical storage management (HSM) — support methods of dumping and restoring the file system metadata. When it is restored, it generally is restored in a more optimal way, but this can be very time consuming. Metadata fragmentation is a new area that has just recently come to the forefront. It is going to take some time before file system vendors are able to effectively address this new area.
File system growth has been dramatic over the last few years. I remember a time not that long ago (1996) when 5 TB file systems were a big deal, and now 70 TB file systems are in use. This exceeds Moore’s Law by a long shot. The key points in this case are recognizing the problems for both data and metadata fragmentation and working with the tools and the vendors to get the problems resolved.