Big Data Storage Takes a Data Lake Swim
As a key point about big data storage, how’s this for a Captain Obvious statement: data is getting bigger. Storage vendors have swung into action to make their systems more scalable, aggregated, faster. This is particularly true for the biggest big data of them all, massive amounts of information whose competitive value requires large-scale analytics.
The “3 V’s” of data storage govern the big data storage arena: Volume, Velocity, and Variety. Big volume is a given – big data storage must have sufficient capacity to store never-ending data growth.
Velocity is the measure of how fast a storage system can ingest and process massive amounts of incoming data. Variety describes mixed data types and file sizes, which in big data may differ radically depending on its source: machine sensors, laboratory experiments, cyber archaeology, weather tracking, medical experiments, documents, logs, files, email, clickstreams and more.
In order to make sense of all this raw data – and to preserve processing speeds on PB-sized data volumes – analytics applications like Hadoop provide computing architecture and analytics. (Although Hadoop is not the only big data analytics storage game in town, it is the market leader for end-user adoption and developer distributions.) Traditionally Hadoop runs on its own commodity hardware platform and depends on direct attached storage (DAS) to preserve performance when analyzing multi-petabyte data stores.
But here’s the problem: big data with analytics value can exist on multiple sources including SAN, NAS, or Hadoop distributions. Would-be analysts are not altogether sure where their data resides. The answer is that it resides all over the place in disparate application storage: in a NAS serving the enterprise CRM, or a customer order database storing to a web farm’s SAN, or in remote point-of-sale machine data living in hundreds of remote locations.
IT must locate the data that the analysts need and copy that data into the Hadoop clusters. Hadoop takes in the data for processing, delivers results, re-processes in response to additional queries, and delivers results again. Even though Hadoop clusters are engineered for massive data processing, locating and moving data from disparate big data storage into the Hadoop clusters demands time and resources.
Big Data Storage: Enter the Data Lake
Similar in concept to a storage pool, a data lake is a scalable storage environment that is purpose-built for analyzing massive amounts of data. Traditionally this is low-level or raw data but Hadoop is also capable of analyzing structured data and streaming or near-realtime data. The data lake supports multiple distributions in a single cluster environment: a logical set of storage units decouples Hadoop from DAS and exposes externalized storage as a local HDFS cluster.
This technology enables companies who own Hadoop to dispense with analytics silos, or “puddles” as some cleverly call them. It’s not unusual for large Hadoop users to deploy distinct Hadoop silos for separate business units and/or Hadoop distributions. For example, a financial services firm may employ distributions from Hortonworks, Cloudera, and MapR depending on their analytics needs. They must move common data between the distributions just as they take time to import data from applications.
In contrast, the data lake big data storage model houses multiple distributions and workloads of vastly different sizes and types, enabling analysts to centralize massive analytical data from different applications in the same logical pool. Data lakes can contain databases, files, audio, video, and general file serving as well as big data types subject to analytics: video, audio, structured and unstructured data, real-time data, machine data, and more.
There is a lot more to do to make this work, and we will discuss those challenges in a moment. But in general, data lakes are an attractive option for solving both the time and uncertainty issue when analyzing big data storage.
Keep in mind that the data lake is not a Shangri-La notion of all enterprise data in a single virtualized pool. Someday we may have the technology to build single repositories to store all enterprise data. We’re already on the way by combining virtualization, common data layers, and software-defined storage. But in reality, today’s analytical data lake doesn’t go that far. It does however go far enough to offer big benefits to analytics big data storage.
Why Does Hadoop even Need a Data Lake?
If all that a company needs are Hadoop distributions that someone else takes care of, they can employ Hadoop-as-a-Service (HaaS). Many cloud providers offer this managed service including Rackspace, Microsoft Azure, IBM BigInsights on Cloud, Google’s Cloud Storage connector for Hadoop, and Amazon EMR. Some of them even offer Data Lakes-as-a-Service. However, managed Hadoop services do little for creating on-premise data lakes that serve massively scaled data and a variety of analytics needs.
Let’s start by using Hadoop and its architecture as an example. Briefly (very briefly), Hadoop consists of a compute component and a data component. The original compute component was MapReduce, now upgraded to YARN in Hadoop 2.0. The data component is Hadoop Distributed File System (HDFS). Both compute and data components scales out as data grows. Scalability is in the high PB levels with the ability to successfully scale up to many thousands of nodes.
This massive size also applies to chunks of moving data. HDFS is engineered for 64MB chunks of serial read I/O, which is why Hadoop is friendliest with applications that are compatible with large data sets. Traditional Hadoop batches data and distributes large chunks of data across clusters built on direct-attached storage (DAS) for locality and highest performance. The clusters perform independent analysis, then combine the output to present to human analysts.
With the introduction of YARN Cluster Resource Management, Hadoop is evolving MapReduce with an eye towards expanding analytics and enterprise-level data services within the Hadoop environment. In addition, Apache offerings including Spark, Hive and Pig, along with Tez acceleration and third-party Hadoop products, now support additional processes including interactive queries, NFS, and streaming data. Additional offerings serve the evolution to the data lake, including compliance controls, security and audit controls, data protection over multiple storage sites, DR and business continuity toolsets, integrated workflows, and centralized management over massive data clusters.
So What’s the Big Data Storage Problem?
But even with this continuing evolution, most Hadoop environments deliberately depend on direct attached storage. The reason that Hadoop users are not diving in pell-mell is because Hadoop still runs best using the DAS model, which means saving money on upgrading network speeds.
At this time, introducing massive analytics data over a network results in high latency and bandwidth challenges. Even if a high performance network is capable of supporting upwards of 10,000 nodes and high IO/low latency, the high cost of network equipment is prohibitive. And even on a fast network, huge data traffic may cause network and storage bottlenecks. Acceleration techniques like caching may not work well on large serial IO, and workloads from other applications may not smoothly integrate with Hadoop data traffic.
In smaller Hadoop environments of 100TB or less, companies can externalize Hadoop clusters via the data lake model without significantly impacting network traffic. But scaling up your Hadoop environment to multiple petabytes will net you real bandwidth headaches over most networks. Once big data storage rises above 100 TB – which it is more than likely to do – large data stores will have a negative impact on shared storage capacity and network bandwidth.
This does not mean that you should not build a data lake even with PB-sized data volumes in Hadoop. There is no law that says this is all or nothing. Start small. Build the data lake for its advantages of enterprise data services, compliance and governance, data protection, and converging data in Hadoop. But only move one distribution or cluster. Move more as you build up your network performance.
What then will it take to effectively build a massively-sized data lake? With the understanding that you do not need to do this all at once, let’s look at some top issues around fully functional big data storage for analytics.
· Issue: This is going to take a while. Apache developed HDFS is an intensively scalable and high-performance file system for Hadoop. Even in large environments of 700-plus nodes, HDFS can quickly process multiple gigabytes of data using DAS storage. HDFS however is specific to Hadoop. It looks like a file system, but its files are immutable. It can addend but cannot modify, so cannot be used as a general file system or a block storage device. Moving some types of application data in and out of Hadoop takes time and resources.
· Issue: Supporting more applications. Some companies only need a traditional Hadoop deployment using DAS. But others want to make Hadoop friendlier to enterprise applications by pointing analyzable data to the data lake storage environment. However, this is not a simple matter of simply storing structured or unstructured data to HDFS. The application developer might add a Hadoop protocol to their application. Failing that, if the data lake storage presents HDFS to the application layer, then users do not have to wait (possibly forever) for developers to add Hadoop protocols. The data lake’ s infrastructure layer will take care of it, allowing enterprise applications to store directly to the data lake for analytics.
· Issue: Over-involving IT. IT will be responsible for deploying and managing their data lakes. However, at present a dupe offers minimal self-service support for Hadoop users, forcing them to heavily rely on IT to copy data to Hadoop big data storage. The end-user of course does not care if they are interacting with the data lake or not. They only want accessibility and flexibility for analyzing business information.
· Issue: How fast is that big data storage? HDFS works very well as the specialized Hadoop file system. However, developments like the Internet of All Things generate massive amounts of small workloads that need fast serial writes. When the enterprise chooses externalized storage to build a Hadoop data lake, they will need high-performance storage that efficiently ingests and processes this kind of data as well as larger data sets. High-performance flash is a given, and for the fastest performance IT may want to deploy in-memory storage layers. This level of performance will also enable IT to quickly populate the new data lake with existing data from network shares, data warehouses, and first-generation Hadoop DAS clusters. In addition, Hadoop already offers several capabilities to customize ingestion for incoming data types and sizes. Examples include Apache Sqoop for big data batch loading and integration with legacy databases, and Apache Flume for small workloads like change deltas. Pivotal Big Data Suite offers Spring XD for data streaming at scale, and GemFire XD for identifying duplicates and databases and ensuring right consistency.
· Issue: Networks on steroids. Network performance is crucial to externalizing Hadoop big data storage. Most enterprise networks average about 20-40GB segments between servers. But when they add massive volumes of data moving into and out of a Hadoop data lake, average-sized network segments are quickly overwhelmed. There are networks capable of running extreme data traffic but they are purpose-built and proprietary. An example is Google’s Colossus, a flat cross-sectional network. Bandwidth is generous because data runs over links without bottlenecking at central switches. However, unless you have a colossus of a network budget you are not likely to have own such a wonder of technology. So when you start with your first data lake, start small and let network acceleration technologies do the heavy lifting. However, this does mean that the larger your data grows, you more you will want to invest in new networking technologies.
· Issue: What about objects? One of the more interesting possibilities is supporting object storage within the data lake. HDFS represents files, directories and as single objects, but this is not the same thing as object-based storage. HDFS is ideal for rich analysis, and object storage for long-term highly secure storage. Ideally the data lake will be able to offer the advantages of both within the data lake. Hortonworks Ozone for example is developing an object store containing HDFS spaces, or buckets. Storiant is developing an HDFS interface for an object store, which lets users run Hadoop analytics in-place on object-based big data storage.
Big Data Storage: What You Can Do Today
The enterprise does not have to wait on data lakes until all of the above challenges are solved. Smaller data lakes and mixed alternatives exist today and can directly benefit analytical big data storage. For example, it is perfectly possible to mix storage resources instead of making the plunge into purely externalized storage. In this case, HDFS can run on local disk but IT can also point it to HDFS-aware external storage without having to import/export. Isilon scale-out NAS is an example. VMware partnered with Dell EMC and Pivotal to add HDFS support to Isilon. To be sure, Isilon can act as a data lake for PB-sized data given sufficient network bandwidth.
One of the more advanced data lakes is from IBM. They named it IBM BigInsights BigIntegrate, which if you ask me (they didn’t) is quite a mouthful. It is however a very good development for adding enterprise data services to Hadoop. Software executes on the data nodes in the Hadoop cluster to integrate data, manage data and metadata, and provides governance. By adding IBM Streams, users can effectively stream data for real-time analytics.
BlueData is another top entrant for integrating Hadoop with external storage. BlueData a software-defined infrastructure, and builds Docker containers for virtual Hadoop and Spark clusters.
Protecting and Governing Big Data Storage
Data lakes also offer enterprise data services for data protection, disaster recovery, security, and persistent data that traditional Hadoop deployments do not. According to Mike Matchett of the Taneja Group, data lakes need the following characteristics to work on massively scaled big data storage: centralized index, data subset management, governance and compliance controls, data protection and DR, and agile workflows.
· Centralized index. A centralized index lets IT effectively manage a large data lake. Picture the environment without it: no way to control data sources or versioning, or to filter data by metadata across multiple petabytes of big data storage. With the index, IT can effectively manage even a very large data lake, and end-users can easily locate data for queries, analysis, and reports.
· Manage and secure subsets of data. In order to expose the data lake to applications and end-users, IT must be able to securely grant access and conduct audits. Fortunately, data lake vendors are not leaving IT alone in the dark, and are actively developing for strong subset security and management.
· Govern data and ensure compliance. Data governance and compliance tools are critical for managing analytics big data storage. Governance should include identifying and reporting on stored data, enforcing retention and deletion policies, and tracking compliance especially around personally identifiable information.
· Data protection and DR at massive scales. A traditional Hadoop environment replicates data so that lost nodes will not impact data accessibility or worse, lose data. In a smaller Hadoop deployment this is not necessarily a problem. But with massive scalability, the default of 3x replication generates a huge amount of data copies. Data lakes on externalized storage have the ability to replicate remotely, and to efficiently replicate only delta changes and similar operations.
· Agile analytics and workflows. A data lake will support more than only Hadoop analytics. Apache itself offers Spark, and the OpenStack community has developed several analytics tools for OpenStack. Some storage systems such as Isilon and Qumulo offer native analytics capabilities as well. Rather than limiting your analytics to Hadoop clusters, a data lake can serve as a platform for multiple analytics toolsets.
Your best bet for a successful data lake implementation is to start with a distinct use case. For example, many companies own large data warehouses with frequent ETL operations (Extract, Transform and Load). Data warehouse expenditures and active ETL both impact storage costs and networking performance. This single use case benefits from transitioning from a data warehouse to a data lake. This need not be done in one fell swoop but can happen over time. Once the data lake is in place, then IT can offload data warehouse activities to the more efficient analytics big data storage. Additional usage cases will increase the data lake’s value.
None of this will happen overnight. Big data storage can be a big headache in many organizations, and there is no single path to providing data lake technology. However, data lakes have the capacity to support massive analytics functions and to converge different data sizes and types. Networking bandwidth will remain an issue for some time to come, but one of the benefits of starting your data lake small is you can add higher-speed networks in fewer segments. Start with data lakes today and grow it efficiently over time to a high performance, massively scalable data lake.
Even when you start small, eventually it will be a big ticket item. Is it worth it? For growing Hadoop data stores, it is. Without the data lake’s ability to converge data, data must be injected into Hadoop from multiple sources. This adds significant time to the process. Yet analysts and executives expect analysis results quickly. Where a large analytics computation might have taken days, now it needs to take seconds to minutes. A data lake with sufficient convergence and bandwidth can effectively meet these requirements, and will also protect and govern the invaluable data under its control.