RDMA TCP/IP: Coming Soon to a Network Near You
As PCI Express bus performance increases and the bus becomes commonplace, network performance will also increase. That will likely shift performance problems to another part of the system, the TCP/IP stack, and the solution there will likely be RDMA.
I think the reason InfiniBand (IB) never took off when it was developed in the late 1990s was because the PCI bus at the time could not take advantage of the performance. There was no way to justify the cost of IB, since PCI bus performance was very slow compared to IB performance.
That picture is changing, thanks to the proliferation of PCI Express on the low end, and I suspect that future large servers on the high-end will be adopting the same technology. Also, vendors appear to be taking more care to ensure that their PCI buses run at rate. All of this means that faster network interfaces are now common. As with many problems in our industry, the performance problem will shift to another part of the system. Since the PCI bus has been fixed for most systems by the introduction of PCI Express, the new problem area will be likely be the TCP/IP stack.
Before you can move data with RDMA, you need to start with DMA TCP/IP, so it might be useful to review how we got to where we are today. RDMA TCP/IP has been developed to address the performance problem associated with the TCP/IP stack running at high speeds, and could not exist without the ability to use DMA between systems in the first place.
Most TCP/IP implementations since the beginning read the data and commands from the NIC into system memory and then processed them. Early in some UNIX implementations, they were read into allocated buffers called MBUFs. Commands and data were separated and processed by reading these MBUFs. This worked fine back when it was developed in the early 1980s, given the speed of token ring and even 10 BaseT.
Basically, three memory copies or page remaps where required for each TCP/IP packet: one copy to read the data and the command from the NIC into the MBUF or more modern operating systems to the IP streaming area, one read to process the command, and a write to move the data (or remapping of a page) to the user. This requires three operating system interrupts along with the data moves, and creates significant overhead.
With the start of higher speed network interfaces such as FDDI, 100 BaseT and especially HiPPI (High Performance Parallel Interface), the TCP/IP stack started to become a performance bottleneck.
Not surprisingly, the first company that experienced this problem was Cray Research in the late 1980s. Technology often gets its start in the HPC world before it gets commoditized and might or might not become commercially accepted.
The HiPPI channel is an example of something developed in HPC that never became commercially acceptable, while a good example of a technology that did become commercially acceptable was Fibre Channel. There are good reasons for technologies wining and losing (see Storage Winners and Losers: Hedging Your Bets.)
What Cray found was that three interrupts and moving data around memory cause significant CPU overhead. The Cray implementation of UNIX was not multi-threaded, so with four or even eight CPUs, a single network interface running at full rate would lock down the operating system so that not much else could be done.
What the networking guys at Cray did was take the concepts used for disk I/O called direct I/O and do I/O to and from the network directly to and from the user address space. Remember that the Cray system and operating system was a real memory system, so nothing was paged, which was done for high performance. A read was done for the command and it was processed, and another read from the NIC was done to move the data from the network to the user space. The MBUFs were no longer used for data but used for commands only. At least that's how I remember DMA TCP data movement working. A few other vendors such as SGI developed DMA TCP/IP stacks, but it never really took off as a standard feature of UNIX operating systems.
This need for DMA data movement changed over time with the advent of Linux Clusters and PVM (Parallel Virtual Machine) and now MPI (Message Passing Interface). These software interfaces allowed users to write parallel applications that ran across the cluster, but communication over standard TCP/IP even with specialized network interfaces was slow, given the stack overhead whether companies used TCP/IP or not. Companies such as Myricom, Dolphin, and Quadrix developed their own DMA hardware and stacks for Linux and MPI to allow high-speed overhead message passing, so the DMA stack had really taken hold.
By the late 1990s, standards were being developed for technologies such as InfiniBand, which has DMA built in as part of the standard, and even the Fibre Channel community developed VI to allow DMA transfers. Sadly, no one really used the VI standard for any real broad-based applications. The use of RDMA over these during this time was limited to parallel applications on Linux clusters, but that is how the technology got its commercial start, and other vendors such as IBM and Compaq adopted these types of products during this period.
About 18 months ago, all this started to change with the commoditization of IB products from a variety of vendors. The need for greater performance had arrived.
So what about RDMA today? Clearly you cannot do RDMA TCP/IP without having a DMA TCP/IP stack or NIC that supports TCP/IP offload (TOE TCP/IP Offload Engine). With 10Gb Ethernet, you are not going to be able to have the same type of TCP/IP stack you have today and run at full rate without enormous operating system overhead, so some type of DMA stack is going to be necessary to move forward with high-speed networking on both ends, thus the need for RDMA TCP/IP.
We are not there yet, but I am confident that we will be doing RDMA TCP in the not too distant future, given the need for speed and the rise of new technologies. The big problem area that I see is that security has to be addressed. If you write directly into the application memory space, there are some basic security problems that need to be considered. Given the work that has gone into TCP/IP security over the last 15 years, we now have, for the most part on most operating systems, a highly secure stack. Adding RDMA does not inherently make the TCP/IP stack no longer secure, but just like with any new function or feature, it will require a significant amount of testing before we can all relax and know we do not have any new holes such as the multiple common buffer overflow problems found in many TCP/IP stacks over the last 10 years.
I have been advocating for higher performance, lower CPU and latency TCP/IP stacks with many vendors since I started in the consulting business in the early 1990s. No company really saw the need until they hit the performance wall. We are finally going to make the jump to high performance, low latency, low CPU TCP for the masses, or at least that's my hope. I am sure there will be bumps in the road, but RDMA is truly a technology whose time has come, and without which you are just going to buy more CPUs. Just ask any large server vendor what their recommendation is for CPUs for 1Gb or 10Gb Ethernet. Want to cut that overhead by 40% to 90% or more? Start using RDMA. Truly this is a cost-effective technology.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.