Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
This month is a tribute to a close personal friend, co-worker, and technical genius who died tragically in April of 2003. I am writing this to honor his memory as well as show how one person can and has changed the technology we all use today.
Laurence (Larry) P. Schermer began as a second shift customer engineer at Univac in 1977 after technical school and then completed a degree in math while working. In 1979, Larry moved to Cray Research, a highly innovative company that sought out equally innovative employees.
Larry's first job at Cray Research was to develop software for the unannounced Cray-1S I/O subsystem (IOS), so that it could read and write IBM 9 track tapes. Larry was part of a team that worked day and night for two straight years to develop software for:
- Booting the IOS
- Communicating with the Cray-1S system
- Disk drivers
- Most importantly, tape drivers and IBM tape emulation channels
This development was critical for Cray Research's strategy of selling to oil companies for seismic research and reservoir modeling, a market dominated by IBM because of tapes, not CPU performance. Without tape support on Cray systems, I believe that much of the oil found by these companies in the 1980s and early 1990s would not have been discovered, or the cost of finding it would have been significantly higher. Oil prices would likely be higher, which could have affected the entire economy. By the end of the 1980s, almost every major oil company in the world was using a Cray Research system.
UNIX and High Performance Computing
By the mid-1980s Cray had decided to move away from batch operating systems to a UNIX-based operating system. Many changes needed to be made to UNIX to support supercomputer functions. Larry helped develop functions such as:
- Memory scheduling of a real memory machine as compared to a virtual memory machine
- A caching mechanism to take advantage of the Cray's SSD hardware
- The listio(3RT) system call, which conceptually was developed for the older Cray COS operating system
Each of these areas required Larry to make major modifications to the original UNIX source from AT&T. Some of these Larry did by himself, while others were done with the help of co-workers.
A major achievement for that time was a new file system that required:
- High performance I/O
- Round-robin allocation
- Separation of data and metadata
- Different allocations for large and small files
- Inodes that were 4K and could store data to reduce seek overhead
Every high performance file system today has either all or almost all of these features, and the same is especially true for the shared file systems. These features reduced the number of disk head seeks, allocation time, and data fragmentation, and also significantly improved performance. All of these innovations were developed largely by a single person for the Cray nc1fs in the late 1980s and are still used today.
Storage Area Networking Ahead of Its Time
Larry left Cray Research in 1992 and took a few years off as, like many of us who work in this industry have been at one time or another, he was burnt out. In 1994, he joined a company called NetStar (later purchased by Ascend and then Ascend by Lucent) that had an idea for a new router. At the time HiPPI (High Performance Parallel Interface) had become the predominant network interface in the HPC community for high speed networking given its peek rate of 100 MB/sec.
Larry worked with a hardware engineer to build a HiPPI to ATM converter. NetStar's goal was to have local HiPPI networking to WAN ATM connectivity. At the same time, Cray Research and other companies such as MaxStrat (a RAID company later purchased by Sun) were producing or going to produce HiPPI-based storage products. Cray had also developed a shared file system. Larry and I often discussed how it would be nice to access file system data without having to be local to the machine. The product for the most part was a market failure primarily due to HiPPI not being a commodity product and VERY expensive costs at the time for ATM WAN lines.
The key point, though, is that today we have commodity Fibre Channel directors from McData, Brocade, and Cisco that support WAN blades. We also have many products that support Fibre Channel to WAN connections from Nishan, LightSan, and others that support this concept. This originally was an idea that was developed in 1994 at NetStar.
In 1995, Larry again walked away from computing and took another year off. Early in 1996, Larry worked on a project to develop a HiPPI device driver for a new Fujitsu Supercomputer, the VPP700. Larry had never worked on the operating system, but as usual he dove in head first. Very early on he realized the performance of the HiPPI channel (~80 MB/sec data rate) would not be achievable from any user applications given the implementation of the MaxStrat Gen-5 RAID controller. Remember, this was 1996 -- very early in hardware RAID's life. The Gen-5 had a very small cache (24 MB) and allocated 64KB per device.
Given the backend structure of the GEN-5, the best performance from a LUN would be a 5+1 RAID-5 internally striped with another 5+1. 8+1 RAID-5 was not supported. Therefore, the stripe value would be 640KB (10*64KB). As almost all file systems at the time allocated in powers of 2, and HPC applications very often read and wrote in powers of 2, Larry immediately saw a huge problem. In his mind, the solution was simple -- add to the device drive a cache that would:
- Read/write on 640K boundaries to the RAID
- Readahead/writebehind in powers of 2 for the user requests and file system allocations
- Provide for syncing functions at shutdown or system crash
Adding this feature allowed the MaxStrat RAID to run at full rate for almost any sequential I/O access reading or writing. It eliminated almost all read-modify-write requests in the RAID and made better use of the very limited RAID cache, which was not a multiple of 640KB.
New File System and New Ideas
Because of the success of the MaxStrat HiPPI driver for Fujitsu, in 1997 Larry and I were asked to develop a new file system for the Fujitsu VPP5000. A number of innovations came from this project, many of which have appeared in other file systems. Here are some of the features of this file system, which was delivered in March of 1998 -- just 7 months after we started:
- Separate data and metadata and separate caches for each file system with different allocation sizes for various RAID alignments. With this feature you could align data for large RAID allocations with RAID-5 and align metadata for small allocation on RAID-1
- Round-robin allocation for metadata. In cases where customers had a huge number of small files (especially common in some parts of the weather industry), the ability to have multiple metadata slices and different allocation sizes for metadata significantly improved performance for this type of environment
- Variable length allocations up to 128MB (did not require powers of 2). This was for RAIDs that did not support power of 2 stripes
- Metadata size could be up to 1MB. This was done so we could mkfs /usr file system. For example, where a metadata allocation equal to the largest command and read that command in without have to read the inode and read the data of the commands. This eliminated one disk read and associated head seek and missed revolution, significantly improving interactive response
- A new allocation method that used allocation buckets and best fit instead of bitmaps and first fit and/or btrees and first fit. This eliminated data fragmentation to almost immeasurable amounts.
- New fsck methodology. Given the overhead of logging file systems because metadata must be written two times, Larry believed that the requirement was fast fsck after reboot not logging. Instead of reading the file system meta blocks one at a time, Larry wrote an fsck that read the entire metadata device(s) and then processed it in memory. During our acceptance in 1998 with FW-SCSI RAID, we had to meet a requirement of being able to fsck 1 million files in a file system 7 directory levels in 20 seconds after a crash. We beat that by eight seconds, surpassing the acceptance criteria. I am unaware of any file system that uses this method even today.
In 2000, Larry worked at SRC, a company developing a hybrid machine using Intel processors and ASICs. In 2003, Larry took his final contract at a startup called Scale8. I am sure both of those companies benefited from the innovation, work, and, most importantly, mentoring that Larry provided to those co-workers just starting out
There are a number of unsung geniuses that have developed innovative technology we all use today. I have been lucky enough to know a few of them, and the industry just lost one of the best. I was fortunate to know him and even more fortunate to call him my friend.
Thanks for reading this. I hope you enjoyed it.