Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
There are other benefits to virtualization and separate data nodes. Not only can multiple compute nodes from one cluster can access a given data node, but virtualization means that multiple Hadoop clusters can be hosted and configured to access the same data nodes.
In fact, HDFS can be offered as a service itself, managed as a more permanent data repository, while various compute “applications” can come and go quite dynamically. In this way, HDFS can now serve as a scale out virtual storage appliance.
Big Data SAN?
One of the cost-compelling reasons to look at a physical Hadoop architecture is to avoid expensive SANs, especially as the data sets grow larger. Yet in the virtual environment it may make sense to consider SAN storage for some big data sets.
One reason is that provisioning compute-only virtual Hadoop clusters is quite simple with VMware’s BDE GUI, but throwing around big data sets is still going to be a challenge. By hosting the data on external shared storage, provisioning virtual Hadoop hosting becomes almost trivial. And hypervisor features like DRS and HA can be fully leveraged. At EMC World 2013, Pat Gelsinger readily demonstrated spinning up and down virtual Hadoop clusters using external Isilon storage.
Another reason to look at SAN storage is if you have data governance concerns. HDFS is not easy to backup, protect, secure or audit. SANs of course, are built with great data protection (and use fewer disks for RAID than triplicate replication) and snapshots. It’s easy to imagine some big data applications where the data is critical enough to want to protect and rollback if necessary. With an eye towards ensuring some high performance networking, performance from SANs can of course provide more throughput than server DAS.
It’s worth mentioning disk failure recovery here too, because with big data on lots of disks, failures become quite common.
In a normal Hadoop cluster, a local disk failure shuts down that node, and Hadoop then works around it. In a virtual environment, a disk failure might shut down the data node, but multiple virtual data nodes can be configured per hypervisor server. And a disk failure that sidelines a virtual data node will not take down any other virtual Hadoop nodes on that hypervisor.
With SAN storage, a highly available Hadoop application might never know that disk failures have even happened.
Is Virtualizing Hadoop Crazy?
There are a number of reasons why virtualizing Hadoop makes sense in many usage scenarios. As a virtual workload, Hadoop can achieve comparable performance to physical hosting in a broad set of expected usage scenarios while further helping consolidate and optimize IT infrastructure investments.
At this point, thousand node clusters with multiple PBs of data in continuous use aren’t likely virtualization candidates. But we think that most organizations have some big data opportunities in the 10-20TB range, and they could be extracting value from that data if only their IT shops could offer scale-out analytical solutions as a cost-effective service.
With a virtual Hadoop capability, a single big data set can be readily shared “in-place” between multiple virtualized Hadoop clusters. That creates an opportunity to serve multiple clients with the same storage. By eliminating multiple copies of big data sets, reducing the amount of data migration, and ensuring higher availability and data protection, Hadoop becomes more manageable and readily supported as an enterprise production application.
In fact, over a wide range of expected enterprise Hadoop usages, the TCO of hosting virtualized Hadoop on fewer but relatively more expensive virtual servers with potentially expensive storage options can still be lower than standing up a dedicated physical cluster of commodity servers. And the open source crowd can start to look towards the competing “Project Savannah” for similar capabilities coming on OpenStack/KVM.
Factoring in the sharing and consolidation of nodes, ease of administration, elastic provisioning, agile servicing, shared data services, and higher availability can lead to favorable cost comparisons. But we think the ability to create a full Hadoop cluster on demand, effectively “thin provisioned,” is seductive enough for many organizations to try it out the vSphere Big Data Extensions on their existing vSphere platforms with little risk. And we believe that will lead to significant adoption.