The State of File Systems Technology, Problem Statement - Page 2
Putting Our Money Where Our Mouths Are
During that great meal we had, Jeff and I said it would be great if someone could really test ext4 and xfs with 50 TB or even 100 TB of storage and put 50 million to 100 million files (or even the proverbial 1 billion files) in the file system with a large number of files per directory -- something we both see in the real world. We thought this was a great idea and had never seen anything published for big Linux file systems nor known anyone to do this. (Note: We don't necessarily consider 50TB and 100TBs a large file system any more, but it's a starting point.) Jeff called around and was able to make that happen. Once we knew it it was possible, we talked about a test plan, schedule and so on.
We both agreed that the problem with large file systems is the metadata scan rate. Let's say you have 100 million files in your file system and the scan rate of the file system is 5,000 inodes per second. If you had a crash, the time to fsck could take 20,000 seconds or about 5.5 hours. If you are a business, you would lose most of the day waiting on fsck to complete. THIS IS NOT ACCEPTABLE. Today, a 100-million file file system should not take that much time, given the speed of networks and the processing power in systems. Add to this the fact that a single file server could support 100 users and 1 million files per user is a lot, but not a crazy number. The other issue is we do not know what the scan rate is for the large file systems with large file counts. What if the number is not 5,000 but 2,000? Yikes, for that business. With enterprise 3.5 inch disk drives capable of between 75 and 150 IOPS per drive, 20 drives should be able to achieve at least 1,500 IOPS. The question is what percentage of hardware bandwidth can be archived with fsck for the two file systems?
This is what we are going to investigate.
One last comment: We may sound pessimistic, but we know Red Hat developers, like Dave Chinner and Eric Sandeen, have been working very hard on improving the metadata performance of xfs. One of the goals of these tests is to see if their effort has resulted in fsck performance that is worthy of enterprise production systems.
We came up with a plan, and with our editor's agreement, Jeff and I are embarking on a four-part series. Each of us will review the other's articles playing to our respective strengths and checking each other's work to make sure we are being fair to the file system and the testing is realistic. We hope to get this work done during the next month and a half, so check back often. Here is the plan:
- Article 1: Problem Statement article, which you just read (Henry)
- Article 2: Test Plan and test plan justification (Jeff)
- Article 3: Reporting on the testing (Jeff)
- Article 4: Analysis of the testing results (Henry)
Bear in mind, however, the following constraints Jeff and I have besides the biggest one, our full-time jobs.
- Jeff does not have unlimited time on the hardware
- Jeff does not have unlimited hardware and severs
- This testing is not about the performance of the hardware, but the performance of the file system; we will attempt to normalize against that
Feel free to write us and let us know what you think, but nothing threating this time, please (our life insurance rates keep going up).
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.