If you are a regular reader you know Jeff Layton and I have similar approaches to technology issues and problem solving. We also both like file systems and work on storage. We recently ran into each other at a file system user group meeting and were discussing Linux file systems and file systems in general over a great Cuban dinner. Jeff and I came to the conclusions that, as it is often said, "Houston we have a file system scaling problem."
A few years ago, I wrote an article in which I made some disparaging remarks about the scalability of Linux file systems. I received more than 30 of emails telling me I did not know what I was talking about, some of them explicit and nearly threatening. Back in April, while I was doing a bit of research for a customer on the maximum size support for a number of file systems, I found the follwing information on Red Hat's web site.
File Systems and Storage Limits
|Maximum filesize (Ext3)||2TB||2TB||2TB||2TB|
|Maximum filesystem size (Ext3)||2TB||8TB||16TB||16TB|
|Maximum filesize (Ext4)||--||--||16TB||16TB|
|Maximum filesystem size (Ext4)||--||--||16TB||16TB|
|Maximum filesize (GFS)||2TB||16TB/8EB||16TB/8EB||N/A|
|Maximum filesystem size (GFS)||2TB||16TB/8EB||16TB/8EB||N/A|
|Maximum filesize (GFS2)||--||--||25TB||100TB|
|Maximum filesystem size (GFS2)||--||--||25TB||100TB|
|Maximum filesize (XFS)||--||--||100TB||100TB|
|Maximum filesystem size (XFS)||--||--||100TB||100TB|
Clearly, the file system community as group has not taken my concerns to heart. There has been progress in some areas, but the goal of a 500 TB single name space Linux file system still seems years away.
What Is the Problem?
When Jeff and I talked over dinner about the merits and demerits of the file systems listed above, the biggest issue we had was that the size limits of the file systems are unbelievably small given the storage sizes we have today.
With 3TB, drives ext 3/4 maxes out at five disk drives. Jeff and I thought that was just insane, given you can buy five 3TB drives at Fry's and put them in your desktop. XFS maxes out at 33 3TB disk drives, and even that is far too small in our opinion. Clearly, supported file system sizes have not scaled with disk drives sizes or the demand for big data. I have a home NAS device with six drives, and it is a good thing I did not get an eight-drive NAS; XFS is not supported by my NAS vendor, and I could be over the ext3 limit and thus have to create multiple file systems.
Laying the Groundwork
We already know there will be a number of naysayers, so let's answer some of your comments upfront (the proverbial "pre-emptive strike" as it were). A first possible comment is, "why do guys need any stinking support, just download and go debug yourself? You know how." That might be true, but the fact of the matter is, it is not about us -- it is about the market reality in business today (and remember, we are both proponents of Linux, and Jeff writes a weekly article series about Linux and storage).
A second possible response is, "you guys are stupid and should not want file systems that big." Our response to this is that the reason you tell us not want these big file system is because they do not scale. We want them and our customers want them. When an eight-drive NAS must be broken into multiple file systems, I think we have a broken file system development and support model.
Jeff and I did some checking and we believe, ultimately, there are two problems.
- In the current file system model, the listed file system has metadata scaling problems for large counts of files with these large sizes. Although XFS is supported to 100TB, what we were told and have seen is that performance degrades with large file counts, especially when the metadata gets fragmented
- As these file systems grow near their limits, performance for streaming does not seem to scale linearly with size and degrades with fragmentation
Jeff and I decided to look at the first problem because without scalable metadata, sustained large block performance does not really matter much in our opinion. I have always been a big proponent of fsck performance. Now some vendors have stated in the past that all you must do is check the logs, and you never need to verify the file system.
That is total and complete garbage. In all the times Jeff or I have seen hardware have a problem, neither of us have ever seen an operational file system be able to recover from a RAID or storage hardware problems 100 percent of the time. It is the nature of POSIX file systems, and the fact of the matter is that only metadata is logged -- not data given the performance implications. Jeff and I speculated during dinner that one of the main reasons Linux puts size limitation on file systems is the amount of time it takes to run an fsck after a hardware incident. It is necessary to run fsck against the metadata (e.g., superblock, inodes, extents and directories) after a crash, and it is critically important after a hardware incident on the storage.
Putting Our Money Where Our Mouths Are
During that great meal we had, Jeff and I said it would be great if someone could really test ext4 and xfs with 50 TB or even 100 TB of storage and put 50 million to 100 million files (or even the proverbial 1 billion files) in the file system with a large number of files per directory -- something we both see in the real world. We thought this was a great idea and had never seen anything published for big Linux file systems nor known anyone to do this. (Note: We don't necessarily consider 50TB and 100TBs a large file system any more, but it's a starting point.) Jeff called around and was able to make that happen. Once we knew it it was possible, we talked about a test plan, schedule and so on.
We both agreed that the problem with large file systems is the metadata scan rate. Let's say you have 100 million files in your file system and the scan rate of the file system is 5,000 inodes per second. If you had a crash, the time to fsck could take 20,000 seconds or about 5.5 hours. If you are a business, you would lose most of the day waiting on fsck to complete. THIS IS NOT ACCEPTABLE. Today, a 100-million file file system should not take that much time, given the speed of networks and the processing power in systems. Add to this the fact that a single file server could support 100 users and 1 million files per user is a lot, but not a crazy number. The other issue is we do not know what the scan rate is for the large file systems with large file counts. What if the number is not 5,000 but 2,000? Yikes, for that business. With enterprise 3.5 inch disk drives capable of between 75 and 150 IOPS per drive, 20 drives should be able to achieve at least 1,500 IOPS. The question is what percentage of hardware bandwidth can be archived with fsck for the two file systems?
This is what we are going to investigate.
One last comment: We may sound pessimistic, but we know Red Hat developers, like Dave Chinner and Eric Sandeen, have been working very hard on improving the metadata performance of xfs. One of the goals of these tests is to see if their effort has resulted in fsck performance that is worthy of enterprise production systems.
We came up with a plan, and with our editor's agreement, Jeff and I are embarking on a four-part series. Each of us will review the other's articles playing to our respective strengths and checking each other's work to make sure we are being fair to the file system and the testing is realistic. We hope to get this work done during the next month and a half, so check back often. Here is the plan:
- Article 1: Problem Statement article, which you just read (Henry)
- Article 2: Test Plan and test plan justification (Jeff)
- Article 3: Reporting on the testing (Jeff)
- Article 4: Analysis of the testing results (Henry)
Bear in mind, however, the following constraints Jeff and I have besides the biggest one, our full-time jobs.
- Jeff does not have unlimited time on the hardware
- Jeff does not have unlimited hardware and severs
- This testing is not about the performance of the hardware, but the performance of the file system; we will attempt to normalize against that
Feel free to write us and let us know what you think, but nothing threating this time, please (our life insurance rates keep going up).
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.