Over recent months, Enterprise Storage Forum has prepared a series of buying guides covering all aspects of storage. This one takes a somewhat different tack, providing advice from analysts on how storage managers should be addressing big data. It covers specific tools, proper planning and architectural considerations, as well as the evolving field of big data security.
“Big data, both large file sizes and a lot of smaller unstructured files, is the fastest growing segment of stored content,” said Thomas Coughlin, analyst with Coughlin Associates. “Cost-effective storage tiering, available metadata and content management are key elements in maintaining and gaining value from large data libraries.”
He passed along an important tip to those looking to invest in big data. Assailed by hype from the bulk of the storage vendor community about their brand new Hadoop capabilities, it would seem that they can all competently deal with any and all big data challenges. Unfortunately, that is far from the truth. Care must be taken to find the approach that integrates best with your existing environment.
“Storage managers need to find tools that are most effective for the types of big data that they are responsible [for] since one size does not, generally, fit all,” said Coughlin.
Further, big data does not just mean buying a big Hadoop data store. That may be only the beginning of the journey. Depending upon the frequency of use and what is considered to be an acceptable degree of latency for content access, appropriate storage tiering may be required. That could include flash memory, hard disk drives (HDD) and possibly even magnetic tape storage, added Coughlin. All of these products should ideally support file-based and object-based storage, particularly since many content libraries are accessed through the cloud. Deduplication, replication and erase coding may also play a big role in large data retention. And of course, the point of big data is to unleash data analytic tools upon it to unlock hidden trends and competitive advantage. That may require extensive coordination with other business units and application owners.
“Depending upon the strength of the IT team, various approaches can be taken from the open source, build-your-own activities of companies like Facebook and Google to big-data-in-a-box-type approaches suiting conventional storage products,” said Coughlin.
A common fault in such a marketing climate is to rush into purchases too quickly. What is required in times like this is a cool head and a strategic direction.
“It's time for data center managers to seriously consider the approach they need to take — does it need to be all flash, or should they use caching and a smaller amount of flash storage, perhaps in the form of solid state drives (SSDs)? asked Jim Handy, an analyst with Objective Analysis. “An all-flash strategy might be appealing because of its determinism, but if their needs grow exponentially, the cost to keep pace will soar.”
A cached approach, on the other hand, may not look as good due to concerns about lags in the event of a cache miss. But keep in mind that this is something systems already deal with every day since the virtual memory systems in existing servers are as much based on probabilities of page hits and misses as are SSD caches. Handy advises users to not pay over the odds for performance they don’t really need — but not to skimp for the data sets and applications that matter. He mentioned companies like Pure, Nimbus, Violin and Skyera that focus their attention on delivering solid state storage at a price that can be lower than hard disk drive (HDD) based storage arrays, with the hopes that users will simply pile their entire database into flash and leave it at that.
Don’t Forget Security
An integral part of strategy is how to secure that growing stash of unstructured data. A recent study of organizations that experienced attacks found that 86 percent had evidence of the breach in their existing log data, but either failed to notice or act upon that information. In addition, 92 percent of incidents were discovered by a third party, and 85 percent took weeks or more to discover.
“Remember the three V’s of Big Data: the capability to deal with data variety and velocity as well as volume,” said Scott Crawford, Managing Research Director at IT and data management industry analyst firm Enterprise Management Associates. “The distributed, parallel nature of environments like Hadoop support greater efficiency and faster performance in executing analysis across larger bodies of data, through divide and conquer techniques such as MapReduce. They can make more subtle attacks more difficult to hide.”
He pointed out that the open source distribution of Hadoop does not include many security capabilities, though commercial distributions such as Shadoop can often add features such as access control, audit logging and authentication. In addition, IBM InfoSphere Guardium has introduced tools for securing big data environment. Disk encryption is another recommended action. But remember that big data is a recent phenomenon and the security community has still to catch up with the capacity to store it. As mentioned earlier, careful planning and due diligence of the needs of your own specific environment apply just as much to securing big data as to storing it.
“Developing strategies for aligning the need to protect data, control access and monitor activity that map well to the highly distributed nature of emerging Big Data environments — without interfering with their value — is still an evolving field,” said Crawford.