Big data is one of those amorphous terms–no doubt some vendors will move from 500 GB to 1 TB drives and call that big data.
“Big data has no particular definition for IT; the only thing consistent about what people mean when they talk about it is that any way you look at it, there’s a lot of it,” said Mike Karp, an analyst with Ptak, Noel and Associates. “Or, to steal a phrase from author Douglas Adams, like space, ‘it is really, really big.'”
The most important feature about big data is that traditional data manipulation tools or storage management techniques can’t handle it adequately. In industry segments that are highly competitive, therefore, how all this data is turned into leverageable intellectual property will become a key differentiate between industry leaders and those firms that will be seen as laggards.
“Analytics–and also, people with analytical skills–will be key drivers of the world economy going forward,” said Karp.
With that in mind, let’s look at some interesting implementations for big data.
Karp calls IBM an industry leader in big data. He said the company is conducting fundamental research as well as product development in this area.
“The Watson program, which famously won on the Jeopardy challenge several months ago, has big data as one of the many element of what makes Watson, Watson,” said Karp. “It will be interesting to see the ways in which IBM pulls what they have learned from developing Watson into their product lines.”
Karp noted that CA, too, may have something going in this department.
“CA also looks to be doing some work in the big data arena, although much of that is under wraps and is unlikely to see the open market for another quarter or two,” he said. “What they are doing certainly bears watching.”
There are two main routes you can travel if you want to analyze content, irrespective of whether that content is structured or unstructured: proprietary analytics tools (IBM, CA and lots of others) and open-source tools. This increasingly means Hadoop, a project of the Apache open source community.
“In the area of open source, many companies rely on Hadoop to provide the fundamental analytical tool for doing analysis associated with clustered, high-performance systems,” said Karp.
EMC is another of the big boys to quickly recognize potential in this sector. It acquired analytics startup GreenPlum more than a year ago. Karp noted that GreenPlum is now seriously looking at developing two levels of Hadoop code, one that is interoperable with the open-source version that comes from the Hadoop community and a second, “enterprise-level” product that can be thought of as a proprietary extension of the open community Hadoop.
When it comes to big block bandwidth, NetApp looks to be in good shape with its recent acquisition of Engenio from LSI–the company is marketing it as its E-Series.
“This has done well in throughput performance type applications as a block device or attached behind NAS and object based clusters,” said Greg Schulz, an analyst for StorageIO Group.
Some big data requirements may be served well with parallel NFS (pNFS), which enables high-speed data movement between machines. It represents the standardization of parallel I/O and allows clients to access storage devices directly and in parallel. This eliminates the scalability and performance issues associated with NFS servers.
pNFS lets you do several things. You can, for example, stripe a single file across multiple NFS servers, which is essentially the same as RAID0. While RAID0 boosts performance by allowing multiple disk drives to serve up data in parallel, pNFS extends this to multiple storage devices connected to the NFS client via a network.
“If using NAS file sharing and NFS, look into pNFS if your needs are for parallel sequential streaming of large files,” said Schulz.
Don’t Be Bullied
Schulz cautioned that there are many different use cases for big data. Therefore, organizations should not rush into the latest crop of big data apps. For more focused application analysis and processing requirements, he said, there are specialty storage solutions, such as HP Vertica and IBM Neteza, in addition to many high-performance NAS or object systems. In certain cases, this may be all you need to adequately deal with a particular brand of big data.
Similarly, for video, security surveillance, CCTV, simulation and large bandwidth or throughput, there are solutions like IBM SONAS, HP IBRIX, Dell Exanet (e.g., FS7500), BlueArc, and several other offerings from HDS, NetApp, Data Direct Networks, Oracle 7000, and EMC Isilon and VNX.
Big databases are another area where brand new big data apps may not be needed–or where the solution should be one that most closely dovetails with a particular architecture.
“For database-centric big data, there are solutions from Terradata as well Oracle, including their Exadata II systems,” said Schulz.
As a final note, get ready for a whole lot of hype telling you to move to this, that or another more expensive system. What you have already may be good enough–if it can be scaled. And what you are being offered may not really work well in your environment.
“Watch out for big data bullies who may want to narrow your options and thinking to what their specific view and product does,” said Schulz. “There are many different aspects, characterization, applications, use cases and deployment scenarios not to mention opportunities in the big data field.”
Drew Robb is a freelance writer specializing in technology and engineering. Currently living in California, he is originally from Scotland, where he received a degree in geology and geography from the University of Strathclyde. He is the author of Server Disk Management in a Windows Environment (CRC Press).