As Non Volatile Memory express (NVMe) storage performance increases, application design will have to change to take better advantage of the Input/Output Operations Per Second (IOPS) and bandwidth found in locally-connected NVMe devices. This is a hypothesis I’ve been developing lately now that I’m semi-retired and have more time to think. If I’m right, this will have big implications for externally attached storage with high compute requirements, which we’ll get to in a moment.
I base this thinking in large part on the massive increase in Input/Output performance we have seen over the last 40 years (from 3 MB/sec to 13 GB/sec). This doesn’t compare to CPU or memory performance increases, of course, and networking increases largely fall in between storage and CPU/memory.
The upper end performance of current PCIe 5.0 Solid State Drives (SSDs) is 13.5 GB per second read and 11.8 GB per second sequential writes. Some vendors are faster on reads and some are faster for write performance, but all are pretty close, as they need to be competitive. You can buy consumer M.2 versions of these SSDs for a few hundred dollars, and enterprise U.2 SSDs in some sizes for under $1,000.
NMVe devices are getting faster and faster as CPU vendors are adding more PCIe lanes. Server vendors are taking advantage of both the faster NVMe devices and more PCIe lanes to create servers with more I/O bandwidth. Current, network-connected storage using traditional protocols such as SAS and/or NFS are going to be a thing of the past for computing on data because of the increased performance available with NVMe connections. The old protocols could still be used for archiving.
Let’s take a look at connection and storage performance, and with PCIe 6 doubling performance over PCIe 5 and expected late this year or early next, let’s add that to the mix too. Here’s the result; you can see that performance will soon overwhelm 100 and 200 Gigabit Ethernet (GbE) connections:
PCI 5 | PCI 6 | |||
Read GB/sec | Write GB/sec | Read est. GB/sec | Write est. GB/sec | |
13.5 | 11.8 | 26 | 22 | |
GbE | % of a single port | % of a single port | % of a single port | % of a single port |
100 | 120.00% | 104.89% | 231.11% | 195.56% |
200 | 60.00% | 52.44% | 115.56% | 97.78% |
400 | 30.00% | 26.22% | 57.78% | 48.89% |
Note: Assumes 90% of networking performance with packet overhead and checksums |
Even today, 100 GbE is not fast enough to move data at the full rate of a single PCIe 5 SSD for either reading or writing. This will get way worse at PCIe 6.
Let’s say today you can buy a 100 GbE port in a server for $100 and an ethernet port in a switch for $300 for a 20 to 30 port switch. If you want to achieve the full bandwidth performance of each SSD, that adds an additional cost to the SSDs of as much as 40%. It is going to be very hard to achieve IOPS with the network latency of ethernet. This situation will work for only so long.
The Application Problem
Enterprise and consumer external storage connected via networks has been around for many decades, from old Inter-Processor Interrupt (IPI), AT Attachment (ATA) and SCON protocols through Serial AT Attachment (SATA), Small Computer System Interface (SCSI), Fibre Channel, Serial Attached SCSI (SAS) and Network Attached Storage (NAS). External storage networking has always been challenged to match the performance of storage. The first Fibre Channel storage array I worked on had Hard Disk Drives in it that significantly exceeded the performance of the 1 GB per second Fibre Channel (FC).
Storage networking vendors have almost always been behind in performance compared to the performance of the storage, and storage enclosures are generally even farther behind. Much of the reason is because the complexity of designing a storage enclosure that achieves the necessary reliability, availability, performance, and manufacturability is both costly and time-consuming. It takes years for vendors to amortize the cost of the systems, even as they take years to develop these systems and storage performance is always increasing.
The problem is that the time between generations of storage technology is compressing, and vendors do not have time to design a system, manufacture it and get cost recovery before the next generation is out. It looks like the time between PCIe generations is also compressing. Applications need to change to take advantage of this paradigm.
Take a standard application today running on a NAS system. You move the data from the NAS system to the location where the application is running. That might be a workstation or a server, but there is no way either of them have the network bandwidth to match the storage performance in the NAS.
For a rough calculation let’s say your NAS has 800 TB of usable storage and the enclosure can operate at full PCIe 5 rates for each device. Let’s say you use 15 TB NVMe U.2 devices, and so you have 54 devices (I am not including any fault tolerance in this example). To run devices at full rate for read, the network bandwidth of the NAS would require 729 GB per second of read bandwidth and/or 637.2 GB per second of write bandwidth. That is basically 16 400 GbE channels for full read rate and/or 14 400 GbE channels for full write rate.
That is a lot of network bandwidth out of the NAS server. No one can design a system of this magnitude and complexity and keep doubling performance every year and make money, and very few can afford the network bandwidth to move data at those rates to every application.
Change is Coming
There have been many decades of NAS application development where data moves from the NAS to the client to be processed and/or written back out, and this is not going to change overnight. What needs to happen is that applications need to process data close to where the data is stored, which means that it is on a server with lots of NVMe-connected storage and then ship the results back to the client.
I know a number of applications moving forward in this direction, but this is going to take some time. Many new applications are being developed in this way given the industry’s cloud direction, but the Network File System (NFS) protocol was developed in 1984, which makes about 40 years’ worth of applications that need to be rewritten.
Think Globally, Compute Locally
Moving data around is expensive in so many ways, from the cost, complexity and maintenance of a high-speed network to the cost of latency and the cost of power to move data around the compute environment. Moving data is not a green technology, so compute where the data is!
As applications require greater and greater performance, and more performance becomes available with more NVMe lanes every few years, old models that use storage enclosures, external storage and old protocols cannot cost-effectively deliver a high percentage of the performance compared to NVMe devices, for both bandwidth and IOPS. These eternal devices and protocols will move farther away from computational resources.
When will this happen? As a friend of mine said about his predictions, “I am almost always correct on what happens and always wrong on when I predict something will happen.”
Whether applications change so they factor problems to use local storage, or whether applications process on the server and just send results to the client, it is going to take time to rewrite applications. Is it two years, five years, or 10 years? It might be two years for some and 10 years for others, but the real issue as always is the capital it takes to continue innovation. At some point companies do not have enough profit to reinvest to innovate and integrate new technologies. It does not help the traditional IT companies that a significant percentage of the IT industry has moved to the cloud, where they develop their own technologies such as enclosures, NVMe devices and software stacks to use them.
I think the days of computing data on remote storage connected via traditional protocols are numbered. We will all be moving to local NVMe-connected storage for applications processing data and turning it into information. The only question is how long this process will take.