Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
In the past, data storage was kind of dumb. It sat there inert – waiting for an application to come along and do something with it.
Those days are gone, as big data and analytics tools seek to unearth trends, isolate opportunities and detect threats in real time.
Here are some tips from the experts on how to get the most out of the evolving relationship between storage and analytics.
1. Define Needs
There is so much hype out there. Enterprises can be tempted to blindly deploy Hadoop, SAS, IBM Watson and an endless list of analytics tools to make their storage sing a sweet and insightful tune.
Those tools may or may not fit into a specific environment. Better to figure out your needs and lay out the most applicable platforms before the selection process begins. “Define your must-haves and desirables clearly before evaluating solutions,” said Anoop Dawar, vice president of product marketing and management at MapR.
2. V6 Turbo
Car aficionados purr when they discuss the merits of V6 or V8 engines. Dawar prefers to discuss the 6 Vs of big data and analytics. Storage professionals should pay close attention to each one with regard to the analytics tools they intend to purchase.
“Consider velocity (speed of data processing), variety (number of types of data), volume (amount of data), vicinity (location awareness), visibility (global namespace) and veracity (strong consistency and data protection),” he said. “And can the solution scale, perform under heavy volume and under fluctuating peak and off-peak traffic?”
3. No Afterthoughts
The initial intent of a system is built into its core. Everything else is layered on top. Hence, the oft-repeated caution to opt for systems designed for a particular purpose. This is the ace card played by startups when criticizing older systems. The old brigade is usually late to the party and their initial efforts to add the latest and greatest functionality are often clunky. But eventually they get it right, whether through acquisition of the startups or by in-house development.
In the case of analytics, the relative newness of the subject in data storage means the newer platforms have the edge. Legacy solutions offered by traditional storage companies may not yet have the scale or cost efficiencies needed to address big data and analytics challenges. Cloud storage, for instance, while being relatively cost effective in some cases, may have higher latency and can be expensive when data needs to be moved. Big data solutions such as Hadoop or NoSQL databases sometimes have reliability and scalability challenges which means they may not do well as e “systems of record” with dependable SLAs for mission critical business use.
“Make sure the solution was built ground up with analytics in mind and not added into the solution as an afterthought,” said Dawar. “At the same time, make sure it can host both legacy and new age applications.”
4. File Analytics
It may not be wise or even necessary to try to incorporate complex IBM Watson-like analytics into data storage systems. But storage companies are finding ways to piggyback simpler analytics and big data features onto their storage platforms. File system analytics represent a good example.
“Storage vendors are adding file system analytics to their file management solutions,” said Cuong Le, senior vice president field operations, Data Dynamics. “File analytics integrated with file movement capabilities provide a more robust and automated file lifecycle management solution.”
5. Amazon S3 Assistance
Many talk about the difficulties in moving data. Large databases and information silos may be a goldmine. But analytics and business intelligence (BI) applications may not be able to get to them. For those with an existing cloud presence such as Amazon S3, there may be relatively simple ways to incorporate that data with cloud-based analytics tools.
“With regards to business intelligence, users need help moving existing data from distributed data silo’s storage repositories that the BI application can operate upon,” said Le. “When centrally placed in S3 repositories, those files can be analyzed by a growing number of cloud-based BI applications.”
6. Pay Attention to File Access
File management can involve users in a labyrinth of gateways, virtual file systems, global namespaces and file stubs. If these are proprietary technologies, that may involve annual license fees to access data. Such systems can also become a bottleneck, introduce increased complexity or add risk.
“Application Programming Interfaces (APIs) are particularly important when modernizing existing legacy applications to access files (whether on SMB, NFS or S3) and get those files where they need it, when they need it, in the form they need it,” said Lee. “Avoid vendor lock-in so you have the freedom to move your files and data.”
7. New Storage Controllers
Storage controllers have traditionally been the gatekeepers between storage, compute and applications. But the newest crop sometimes come with built-in analytics.
“Advanced storage controllers have been developed to provide high performance for storage functions key to analytic workloads, including compression, RAID and encryption,” said Clodoaldo Barrera, chief technical strategist, IBM Storage.
NVMe brings the necessary speed to data transport mechanisms closer to the velocity of modern processors and flash architectures. Products already on the market incorporating it, run up to six times faster when it comes random and sequential read/write performance. Although we are still in the early stages of its evolution in storage, it is fast becoming an industry standard, and it is being added to switched fabric. Developers are figuring out ways to utilize it throughout the entire storage array stack. Thus, you will soon see NVMe in frontend, internal and backend applications.
“The quest for performance has led to innovation across the entire storage data path; the industry has developed a low latency data protocol, NVMe, to replace SCSI, and will shortly see the delivery of NVMe over a switched fabric,” said Barrera. “Analytic applications are becoming mission critical, and there is a need for the business continuity functions that storage provides for database environments, including snapshots and synchronous and asynchronous data replication.”
9. Beware Further Silos
For decades now, cautions about erecting silos have gone unheeded. As soon as a new tool or platform comes along, yet another silo gets set up. Unfortunately, the temptation is strong to create multiple storage environments for analytics applications. Many of these depend on high performance or are introduced by data scientists who want to manage their own data, including storage, said Barrera. That may be OK until the application becomes mission critical.
“Most app owners are not able to provide their own business continuity, security and privacy facilities,” he said. “Astute IT teams will get out in front of application demand by identifying the analytics tools desired, and providing a managed storage environment that complies with enterprise needs but still allows data scientists the experience they want.”
10. Real-Time or Not?
Everyone wants speed. Everyone demands the best possible performance. But not application needs it. And few can afford to add high-end systems everywhere indiscriminately. It’s the same with analytics. Most would want real-time analytics. Banks may certainly need it, and so might huge e-commerce sites. But not everyone by a long shot.
“Storage systems are designed to provide optimized solutions for specific application workloads and use cases,” said Paul Speciale, vice president of product management at Scality. “As such, it’s important to design big data solutions appropriately for real-time analytics access patterns.”
The Hadoop File System (HDFS) is often best for real-time analysis of unstructured data. However, when you are dealing with longer-term archival storage, an object storage solution such as the Scality RING may be more efficient. Aligning the right type of storage with the use-case, workload tiers is the proper approach for big data analytics.