Virtualizing Hadoop Impacts Big Data Storage
by Mike Matchett, Sr. Analyst, Taneja Group
Hadoop is soon coming to enterprise IT in a big way. VMware’s new vSphere Big Data Extensions (BDE) commercializes its open source Project Serengeti to make it dead easy for enterprise admins to spin and up down virtual Hadoop clusters at will.
Now that VMware has made it clear that Hadoop is going to be fully supported as a virtualized workload in enterprise vSphere environments, here at Taneja Group we expect a rapid pickup in Hadoop adoption across organizations of all sizes.
However, Hadoop is all about mapping parallel compute jobs intelligently over massive amounts of distributed data. Cluster deployment and operation are becoming very easy for the virtual admin. But in a virtual environment where storage can be effectively abstracted from compute clients, there are some important complexities and opportunities to consider when designing the underlying storage architecture. Some specific concerns with running Hadoop in a virtual environment include considering how to configure virtual data nodes, how to best utilize local hypervisor server DAS, and when to think about leveraging external SAN/NAS.
Big Data, Virtually
The main idea behind virtualizing Hadoop is to take advantage of deploying Hadoop scale-out nodes as virtual machines instead of as racked commodity physical servers. Clusters can be provisioned on-demand and elastically expanded or shrunk. Multiple Hadoop virtual nodes can be hosted on each hypervisor physical server, and as virtual machines can be easily allocated more or less resource for a given application. Hypervisor level HA/FT capabilities can be brought to bear on production Hadoop apps. VMware’s BDE even includes QoS algorithms that help prioritize clusters dynamically, shrinking lower-priority cluster sizes as necessary to ensure high-priority cluster service.
Obviously, one of the big concerns with virtualizing Hadoop is about performance. Much of Hadoop’s value lies in how it effectively executes parallel algorithms over distributed data chunks. Hadoop takes advantage of high data “locality” by spreading out big data over many nodes using HDFS (Hadoop Distributed File System). It then farms out parallelized compute tasks local to each data node for initial processing (the “map” part of MapReduce implemented by job and task trackers).
The design, with each scale-out physical node hosting both local compute and a share of data, is intended to support applications like searching and scoring. These applications might often need to crawl through all the data of massively large data sets, which are commonly made up of low-level semi- or unstructured text and documents.
Commonly, each HDFS data node will be assigned raw physical host server DAS disks directly by the hypervisor. HDFS will then replicate data across data nodes, by default making two copies on different data nodes. On a physical cluster, replicates are placed on different server nodes by definition (one data node per server). HDFS also knows to place the second replicate on a different “rack” of nodes to help avoid rack level loss.
In the virtual world, Hadoop must become aware of the hypervisor grouping of virtual nodes in order to ensure good physical data placement and subsequent job/task assignment. This virtual awareness is implemented by the Hadoop Virtual Extensions (HVE) that VMware contributed into Apache Hadoop 1.2.
Hadoop Virtual Extensions
The Hadoop Virtual Extensions do break the virtual abstraction between application and physical hosting. But in some ways, the Hadoop platform can be seen as another layer of the virtualization, adding scale-out data and computing management to the hypervisor.
The HVE essentially inserts a new level of “node group” into the Hadoop hierarchy between nodes and racks. Node groups represent the set of virtual Hadoop nodes on each given hypervisor server to help inform Hadoop and HDFS management algorithms.
The effect is that Hadoop can maintain knowledge of “data locality” even in the virtual environment to keep compute tasks close to required data for performance, and ensure optimal placement of replicates for fault tolerance.
Data Node Options
When you virtualize Hadoop nodes, you also have the option to separate the compute side (task trackers, et.al.) from the data node and place them each in different virtual machines. If the compute node and data node virtual machines still reside on the same hypervisor server, then they can effectively communicate over a virtual network “in-memory” and won’t suffer any significant physical network latencies. HVE can ensure that this data local relationship is maintained for performance.
Separating out the data node from compute nodes gives you orthogonal scaling and the option to host multiple compute nodes sharing a single data node. This new flexibility enables optimizing the utilization of the host physical server resources, although getting the ratios right for each application might require a lot of experimentation.