If you don't plan carefully, big data initiatives can quickly overwhelm your storage infrastructure.
Big data applications are growing as organizations mine their data for insights about clients, suppliers and operations. But as capacities grow and data becomes more sensitive, the underlying storage remains an important consideration.
Here are ten tips on how data storage professionals can stay on top of the big data deluge that threatens to overwhelm their systems.
The introduction of flash technology and new storage system designs that include compression and deduplication have become a vital necessary in a big data world, said Clodoaldo Barrera, chief technical strategist, IBM Storage.
“As the business begins to rely on big data insights, the big data apps become mission critical,” he said. “Backup, archive and disaster recovery must also be added to the operational requirements.”
Whenever a new workload appears, it is tempting to treat it as a new type of computing, requiring new storage infrastructure. The usual argument is "a new type of storage is a better fit for this new workload," usually citing lower cost or better performance. The problem with this approach is that it creates separate islands of storage and data for each application type, said Barrera. Each island must have its own management, security, business continuity, upgrade path etc., and requires its own planning and operations management. Worse yet, separate islands inhibit the mobility of data between workloads; transaction processing, real-time analytics and big data applications need to operate against a common base of data.
“In preparing a big data environment, give thought to the needs and costs of the overall storage infrastructure and carefully consider how many different data and storage environments are really necessary,” said Barrera.
Speaking of silos, a vital first step for many is to consolidate their big data storage environment and thereby eliminate various data silos that exist across their organizations. This is important for two reasons: first, it is difficult to apply big data tools effectively across disparate data pools. And second, a consolidated data storage environment is generally more efficient and easier to manage. To enable this approach, the IT infrastructure needs to be able to support a wide range of applications and workloads on a single storage platform.
“Data consolidation can help organizations reduce costs, simplify IT management and set the stage for efficient use of unstructured data analytics tools to extract more value from data assets” said Varun Chhabra, senior director of product marketing, Dell EMC Unstructured Data Storage. “Because many organizations use a wide range of applications and workloads to support their businesses, it is important to select a storage infrastructure that has multi-protocol support capabilities that can provide significant operational flexibility.
There are a lot of big data storage tools out there. None are right for every application. Select carefully to match your own application and your own environment.
“Don't assume just because a solution says big data and analytics support that it will work for your application,” said Greg Schulz, an analyst with StorageIO Group. “If you are doing Hadoop, get something optimized for that, or video processing, get something optimized for that. Look beyond the buzzword check box.”
Forty-two percent of all data will qualify as “machine generated” by 2020, according to IDC. This data is generated almost constantly and in copious amounts, in forms such as application logs, sensor data, business process logs and message queues, and it holds a potential gold mine for CIOs and business leaders. To keep up with data growth and monetize its opportunities, companies need the right people and the right tools. But unlocking the potential of machine learning entails correlating and mathematically analyzing massive data sets. Thus, careful planning of the underlying storage architecture is essential.
“Today's big data initiatives involve a ton of data and a ton of infrastructure,” said Laz Vekiarides, CTO of ClearSky Data. “Be prepared.”
Most big data projects are undersized in terms of performance and capacity from the onset, added Vekiarides. Initial estimates of how big the big data might get are often laughable within a year or two. This is largely because the value of these projects to the organization is under-scoped. Therefore, a growth plan is a requirement from the very beginning.z,/pz.
“Seek out consumption-based models that allow you to grow on-demand without having to pay for unused capacity, software and infrastructure,” said Vekiarides. “Elasticity matters most when data sizes are growing quickly and require rapid access — both of which are true in big data and analytics.”
Once a peta-scale data set is created, it is very difficult to protect comprehensively after the fact. What sometimes happens is that unwieldly data sets are created across multiple platforms without any real thought to how to secure the data. But then the realization sets in that a single bad accident could result in the loss of incalculably valuable data. Alternatively, data can become stranded in a public cloud when the tool to analyze it sits elsewhere, either in another public cloud or on-prem locations.
“Think about disaster recovery and security in advance as this data will soon become a strategic asset,” said Vekiarides. “Look at how broadly you want to use it as well as how you can ensure it is secure and protected.”
Not all unstructured data has the same value, and its value often changes over time. Data used in applications and workloads that demand a high-performance infrastructure will require high-performance storage resources (e.g. all-flash). Other data, such as older, little-used data may be archived and will not require the need for high performance. Utilizing the same type of storage systems for all data will generally result in inadequate levels of performance. Using a storage system that with an automated, policy-based tiering capability can ensure that data is supported with correct level of performance.
“This approach will optimize storage resource investment and eliminate costly movement of data manually,” said Chhabra.
Photo courtesy of Shutterstock.