by Mike Matchett, Sr. Analyst, Taneja Group
Hadoop is soon coming to enterprise IT in a big way. VMware’s new vSphere Big Data Extensions (BDE) commercializes its open source Project Serengeti to make it dead easy for enterprise admins to spin and up down virtual Hadoop clusters at will.
Now that VMware has made it clear that Hadoop is going to be fully supported as a virtualized workload in enterprise vSphere environments, here at Taneja Group we expect a rapid pickup in Hadoop adoption across organizations of all sizes.
However, Hadoop is all about mapping parallel compute jobs intelligently over massive amounts of distributed data. Cluster deployment and operation are becoming very easy for the virtual admin. But in a virtual environment where storage can be effectively abstracted from compute clients, there are some important complexities and opportunities to consider when designing the underlying storage architecture. Some specific concerns with running Hadoop in a virtual environment include considering how to configure virtual data nodes, how to best utilize local hypervisor server DAS, and when to think about leveraging external SAN/NAS.
Big Data, Virtually
The main idea behind virtualizing Hadoop is to take advantage of deploying Hadoop scale-out nodes as virtual machines instead of as racked commodity physical servers. Clusters can be provisioned on-demand and elastically expanded or shrunk. Multiple Hadoop virtual nodes can be hosted on each hypervisor physical server, and as virtual machines can be easily allocated more or less resource for a given application. Hypervisor level HA/FT capabilities can be brought to bear on production Hadoop apps. VMware’s BDE even includes QoS algorithms that help prioritize clusters dynamically, shrinking lower-priority cluster sizes as necessary to ensure high-priority cluster service.
Obviously, one of the big concerns with virtualizing Hadoop is about performance. Much of Hadoop’s value lies in how it effectively executes parallel algorithms over distributed data chunks. Hadoop takes advantage of high data “locality” by spreading out big data over many nodes using HDFS (Hadoop Distributed File System). It then farms out parallelized compute tasks local to each data node for initial processing (the “map” part of MapReduce implemented by job and task trackers).
The design, with each scale-out physical node hosting both local compute and a share of data, is intended to support applications like searching and scoring. These applications might often need to crawl through all the data of massively large data sets, which are commonly made up of low-level semi- or unstructured text and documents.
Commonly, each HDFS data node will be assigned raw physical host server DAS disks directly by the hypervisor. HDFS will then replicate data across data nodes, by default making two copies on different data nodes. On a physical cluster, replicates are placed on different server nodes by definition (one data node per server). HDFS also knows to place the second replicate on a different “rack” of nodes to help avoid rack level loss.
In the virtual world, Hadoop must become aware of the hypervisor grouping of virtual nodes in order to ensure good physical data placement and subsequent job/task assignment. This virtual awareness is implemented by the Hadoop Virtual Extensions (HVE) that VMware contributed into Apache Hadoop 1.2.
Hadoop Virtual Extensions
The Hadoop Virtual Extensions do break the virtual abstraction between application and physical hosting. But in some ways, the Hadoop platform can be seen as another layer of the virtualization, adding scale-out data and computing management to the hypervisor.
The HVE essentially inserts a new level of “node group” into the Hadoop hierarchy between nodes and racks. Node groups represent the set of virtual Hadoop nodes on each given hypervisor server to help inform Hadoop and HDFS management algorithms.
The effect is that Hadoop can maintain knowledge of “data locality” even in the virtual environment to keep compute tasks close to required data for performance, and ensure optimal placement of replicates for fault tolerance.
Data Node Options
When you virtualize Hadoop nodes, you also have the option to separate the compute side (task trackers, et.al.) from the data node and place them each in different virtual machines. If the compute node and data node virtual machines still reside on the same hypervisor server, then they can effectively communicate over a virtual network “in-memory” and won’t suffer any significant physical network latencies. HVE can ensure that this data local relationship is maintained for performance.
Separating out the data node from compute nodes gives you orthogonal scaling and the option to host multiple compute nodes sharing a single data node. This new flexibility enables optimizing the utilization of the host physical server resources, although getting the ratios right for each application might require a lot of experimentation.
There are other benefits to virtualization and separate data nodes. Not only can multiple compute nodes from one cluster can access a given data node, but virtualization means that multiple Hadoop clusters can be hosted and configured to access the same data nodes.
In fact, HDFS can be offered as a service itself, managed as a more permanent data repository, while various compute “applications” can come and go quite dynamically. In this way, HDFS can now serve as a scale out virtual storage appliance.
Big Data SAN?
One of the cost-compelling reasons to look at a physical Hadoop architecture is to avoid expensive SANs, especially as the data sets grow larger. Yet in the virtual environment it may make sense to consider SAN storage for some big data sets.
One reason is that provisioning compute-only virtual Hadoop clusters is quite simple with VMware’s BDE GUI, but throwing around big data sets is still going to be a challenge. By hosting the data on external shared storage, provisioning virtual Hadoop hosting becomes almost trivial. And hypervisor features like DRS and HA can be fully leveraged. At EMC World 2013, Pat Gelsinger readily demonstrated spinning up and down virtual Hadoop clusters using external Isilon storage.
Another reason to look at SAN storage is if you have data governance concerns. HDFS is not easy to backup, protect, secure or audit. SANs of course, are built with great data protection (and use fewer disks for RAID than triplicate replication) and snapshots. It’s easy to imagine some big data applications where the data is critical enough to want to protect and rollback if necessary. With an eye towards ensuring some high performance networking, performance from SANs can of course provide more throughput than server DAS.
It’s worth mentioning disk failure recovery here too, because with big data on lots of disks, failures become quite common.
In a normal Hadoop cluster, a local disk failure shuts down that node, and Hadoop then works around it. In a virtual environment, a disk failure might shut down the data node, but multiple virtual data nodes can be configured per hypervisor server. And a disk failure that sidelines a virtual data node will not take down any other virtual Hadoop nodes on that hypervisor.
With SAN storage, a highly available Hadoop application might never know that disk failures have even happened.
Is Virtualizing Hadoop Crazy?
There are a number of reasons why virtualizing Hadoop makes sense in many usage scenarios. As a virtual workload, Hadoop can achieve comparable performance to physical hosting in a broad set of expected usage scenarios while further helping consolidate and optimize IT infrastructure investments.
At this point, thousand node clusters with multiple PBs of data in continuous use aren’t likely virtualization candidates. But we think that most organizations have some big data opportunities in the 10-20TB range, and they could be extracting value from that data if only their IT shops could offer scale-out analytical solutions as a cost-effective service.
With a virtual Hadoop capability, a single big data set can be readily shared “in-place” between multiple virtualized Hadoop clusters. That creates an opportunity to serve multiple clients with the same storage. By eliminating multiple copies of big data sets, reducing the amount of data migration, and ensuring higher availability and data protection, Hadoop becomes more manageable and readily supported as an enterprise production application.
In fact, over a wide range of expected enterprise Hadoop usages, the TCO of hosting virtualized Hadoop on fewer but relatively more expensive virtual servers with potentially expensive storage options can still be lower than standing up a dedicated physical cluster of commodity servers. And the open source crowd can start to look towards the competing “Project Savannah” for similar capabilities coming on OpenStack/KVM.
Factoring in the sharing and consolidation of nodes, ease of administration, elastic provisioning, agile servicing, shared data services, and higher availability can lead to favorable cost comparisons. But we think the ability to create a full Hadoop cluster on demand, effectively “thin provisioned,” is seductive enough for many organizations to try it out the vSphere Big Data Extensions on their existing vSphere platforms with little risk. And we believe that will lead to significant adoption.