It was not all that long ago that InfiniBand (IB) was going to take over the world, replacing local and storage communications and solving various other technological and societal ills. This claim came from many organizations and startups during the late 1990s and the early part of this decade. I knew a number of people who went to work at IB startups during this time, only to find themselves out of a job not too long afterwards. Was IB a failure of the dotcom meltdown, or were other issues at work?
Today IB is making a comeback, although not with the same level of hype as it did several years ago. So why is IB back and making inroads in both storage and communications? Why did Cisco buy an IB company (TopSpin)? Clearly, it’s not 1999 anymore.
I’ll try to explain why I think IB is making a comeback and why it still might become the Fibre Channel killer that it was supposed to be back in the late 1990s, and what this might mean for local communications.
IB is still not a mainstream technology, especially for storage, but it might become more general purpose as time wears on. Today it is used mainly for a number of high-performance computing applications.
What IB Missed the First Time
As many of you probably recall, most servers in the late 1990s and early 2000s used simple PCI 64 bit 66 MHz, which had a theoretical peak performance of 532 MB/sec. Sun was still selling and supporting the SBUS, and some vendors were still using slower VME buses. SGI supported a faster bus, which was required for their HiPPI6400 (High Performance Parallel Interface), but that bus was proprietary. It should be noted that SGI tried to get others to adopt the HiPPI6400 standard to no avail, and I think I know why, since it was the same reason IB failed.
A number of IB vendors were popping up at the time, building IB HCAs (host channel adapters), while Compaq, HP, IBM and Sun dominated the server market. Only Sun had an open driver interface, but IB required more than just a driver interface given the DMA part of the specification. The market was controlled by large server vendors that had software interfaces that were not open, but IB companies needed to get the server vendors to support the products they were developing. The way I see it, the server vendors had no interest for a number of reasons:
- The PCI buses — even if they ran at the theoretical rate of 532 MB/sec, and most buses did not support anywhere near the theoretical rate — could only support a fraction of the performance of IB.
- Server vendors wanted to sell what they had, and Fibre Channel was new and available. Storage vendors were not supporting IB, and without that, IB was just going to be another local high-speed interconnect.
- What’s in it for me? All vendors needed to see a return on investment for what they were doing at the time, since IB was going to be a change in the Ethernet path.
- Some high-end vendors had their own proprietary interconnects. IBM had their own switch plans, and SGI had developed HiPPI6400 as a proprietary bus since PCI was not fast enough. The bus was unavailable to IB vendors.
For these reasons, IB failed when it was initially developed, but changes were on the horizon. It was also about this time that Linux clusters just started to become an important part of the HPC market. These clusters needed an interconnect, and technologies such as Myrinet, Quadrics and Dolphin allowed high speed DMA connectivity with low latency. These vendors worked to modify MPI (message passing interface), which was used for parallel applications, to take advantage of their technologies. These technologies plugged into standard PCI slots in the machine, allowing for easy integration into Linux environments with open drivers, and not surprisingly, some server vendors started to look at these technologies for clustering their SMP. The NIC cards from these vendors were limited in bandwidth and latency, not by the design of the NIC cards but by the limitations of the PCI slots, since most did not run anywhere near the rated performance.
Fast forward a few years and you find that these vendors moved their technology forward and supported PCI-X and later PCI Express. So slowly but surely, PCI bus technology evolved to support IB rates.
Change Arrives
In 2004, I noticed that Linux cluster vendors were beginning to talk about using IB the following year. I asked myself what had changed, and came up with the following:
- PCI-X bus performance was near the performance of IB, and many of the PCI-X buses actually ran at rated speed of just over 1 GB/sec.
- PCI-Express was on the short-term horizon for some vendors, and the performance bump more than met the IB requirements.
- Customers wanted a standard for DMA between machines for parallel applications and IB was an open standard with a standard version of MPI needed for parallel applications.
- Some storage vendors were showing interest in DMA communications given the reduction in latency, but none wanted to deal with three different companies; they wanted a standard. IB gave them a single standard.
Also at this time, server vendors started showing an interest in IB, so there was a convergence of factors that led to where we are today: Linux clusters, improved buses, vendors with DMA products that pre-dated IB, and multiple vendors developing IB NICs and switches. Without all that, IB would not have become reality.
The Future
As with many standards, IB has a short time to be adopted or else run the risk of becoming a non-issue. IB must become a commodity. With the RDMA TCP/IP standard nearly here in products, and 10 Gb Ethernet and other connectivity plans from the storage vendors, the IB standard must be adopted quickly or it will become a thing of the past. IB has a number of things in its favor right now:
- Storage connectivity with DMA — a few vendors such as DataDirect and LSI Logic have developed IB storage interfaces. This is very important in the HPC Linux cluster world, since these customers desire a single communications fabric that supports parallel applications and storage. IB provides that today.
- Fibre Channel performance is lagging. Look at the time it took to go from 1 Gb to 2 Gb and now from 2 Gb to 4 Gb. IB is here today at 10 Gb, which allows higher performance connectivity to current RAID controllers. Of course, I don’t expect disk drive vendors to support IB anytime soon, so Fibre Channel is not going to die.
- No high-end storage vendors currently support iSCSI or SCSI over 10 Gb Ethernet. Though some have talked about this, products have yet to appear.
Let’s face it, if IB didn’t have potential, Cisco wouldn’t be spending money on it. That’s plenty of evidence for the technology’s potential right there.
Still, IB faces some challenges in its bid for market acceptance as a solution for low latency communications and high bandwidth between systems and storage. Those challenges are:
- How fast will DMA TCP/IP with 10 Gb Ethernet become available for both storage and HPC communications, since these customers are currently leading the market with requirements?
- What is the real bandwidth and latency of DMA TCP/IP for MPI and storage applications?
- Can IB move beyond MPI for communications (IB supports TCP/IP, but often the latency increases and bandwidth drops) and support applications such as parallel databases and show significant performance improvement?
Would I go out and buy an IB communications environment today? My answer today would depend on where it was being used and what the requirements are. This is a different answer than I gave a customer six months ago when I said no way. IB is not for everyone yet, but it could get there, and given the big names behind it, it could get there faster than we think. Don’t put IB too far in the back of your mind, since it might just turn out to have staying power.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.