Big Data and In-Storage Processing
If storage were free, where would you put it? The obvious answer is "as close to the processors as possible."
But when you're talking about big data, it makes more sense to pose the question in a slightly different way: if all processing were free, where would you put it? The answer is "as close to the storage as possible."
And that's exactly what in-storage processing attempts to do in big data scenarios. Instead of moving terabytes of data from the storage systems to the processors, it runs applications on processors in the storage controller.
Of course, processing power isn't actually completely free, but its price has certainly fallen dramatically. What's more, the trend is towards storage systems which do away with the need for custom ASICs. Instead, they consist of storage software running on nothing more than conventional, industry-standard servers. These servers have formidably powerful processors which can do far more than run the storage software.
"What storage vendors are increasingly saying is that they have spare capacity for processing on their storage servers," says Mark Peters, a senior analyst at Enterprise Strategy Group (ESG). The obvious thing to do is use that processing capacity for something other than storage—like running applications in the storage system. "I think that this is storage vendors being pragmatic, and suggesting the processing resources be used more fully," he says.
The prudent way to do this is by running a limited number of virtual machines in the storage servers and allowing these virtual machines to run suitable applications. And this begs the question—what sort of applications are best suited to being run in this manner?
Relatively simple applications work best, according to Jeff Denworth, marketing VP at big data storage vendor DataDirect Networks, a company that offers in-storage processing in its storage systems. "The best sort of applications in this environment are ones that run pre-processing or post-processing algorithms on data, analyzing data, filtering data or applying metadata," Denworth explains. "But you have to remember that this is not a replacement for a supercomputer, as there is not a wild amount of computing power in a storage system," he adds.
The applications also need to run on an operating system supported by the in-storage hypervisor—typically Linux or Windows. (DDN's system used a modified KVM virtualization system to host virtual machines, with the I/O infrastructure modified to present the app with a collection of memory pointers.) And clearly the applications can't rely on GPU acceleration because there is no powerful graphics subsystem on a storage device.
In-Storage Processing Examples
As it happens, pre-processing and post-processing algorithms are just the sort of applications that are typically required in big data environments.
For example, the International Centre for Radio Astronomy Research (ICRAR) generates a million terabytes of data every day from its Square Kilometre Array telescope. That's a phenomenal amount of data, but only a tiny fraction of it is interesting and needs to be retained—the rest is useless "noise" that can be discarded. The tricky part is analysing this data and filtering out the noise from the interesting stuff. To do this ICRAR stores the incoming data in a DDN storage system, and runs data reduction algorithms in a virtual machine embedded in the storage system, using the storage system's processing resources.
And at the Department of Energy, supercomputers generate tens of petabytes of raw data running climate simulations and other mathematical models. That data feeds its storage systems at the rate of about 100 gigabytes per second. Dr Rob Ross, a storage researcher in the DOE SciDAC Enabling Technology Center for Scientific Data Management, says that the benefits of analysing this data in the storage system are reduced costs and increased speed.
"There's a limit to how fast you can move your data to a computer from storage for a start," he says. Removing this networking element cuts the overhead of moving data through a host bus adapter to a switch and on to a server for processing and results in lower levels of latency as there are fewer hops. The apps run in the same memory address space as the storage system cache.
"Then there's the cost of powering and running a network to move the data, and the cost of waiting around for the data to be moved. Carrying out in-storage processing is just a much smarter way of doing things," says Ross.
Poised for Takeoff
In-storage processing has actually been around for some time—DDN introduced it into its storage devices back in 2009—but it's fair to say that it has not taken off in a huge way yet. For example, Denworth says only about 10% of DDN's customers currently use the technology.
One reason for this may be reluctance on the part of the larger vendors to introduce the technology. ESG's Mark Peters explains, "All the companies that are doing in-storage processing are smaller companies." Aside from DDN, other companies involved in the space include Pivot3, and Scale Computing. "I don't think that the bigger companies want people to understand that their storage may be running on a standard X86 server," he says.
But it's also true to say that big data is only just coming of age, and as massive data stores become increasingly prevalent in the enterprise more and more vendors could embrace the technology. That's certainly the view of ESG's Mark Peters. "I think that, in the future, carrying out processing in the storage system will definitely become more standard," he concludes.