The Top of the Big Data Stack: Hadoop, MapReduce, and Analytics
Back in May, Henry kicked off a collaborative effort to examine some of the details behind the Big Data push and what they really mean. This article will continue our high-level examination of Big Data from the stop of the stack -- that is, the applications. Specifically, we will discuss the role of Hadoop and Analytics and how they can impact storage (hint, it's not trivial).
Henry and I have undertaken the task of examining Big Data and what it really means. It's a buzzword. Like many buzzwords, it has been beaten to death yet contains a kernel of real usefulness and technology. We've decided to tackle Big Data by doing some ground and pound on the subject, and find the kernels of truth and what they mean for storage solutions.
Henry kicked off the series with a great introduction, including what I consider to be the best definition for Big Data I've seen. Hence, I will repeat it, yet again:
|Big Data is the process of changing data into information, which then changes into knowledge.|
This definition is so appropriate because the adjective "Big" can mean many things to many fields of interest. Rather than focus on what some people think of as "Big" for their particular field, we can instead focus on what you do with the data and why.
Henry and I chose to tackle the discussion by coming from two different directions. Henry is starting at the very bottom of the stack with the hardware itself and then moving up through the stack. More precisely, what aspects of hardware are important for Big Data and what technologies are important? I'm starting at the top of the Big Data stack with the applications and then moving down through the stack. We'll meet somewhere in the middle and collect our ideas and comments into a final article.
Starting at the top isn't easy, and my original article became rather lengthy. So we collectively decided to cut it into three parts. The first part started off discussing some fundamental issues at the top, including the importance of getting data into a storage system for use (this is going to be more important than most people realize). It also discusses the most common tools for Big Data -- NoSQL databases. The second article examined eight NoSQL database classes used in Big Data that impact storage. This final look at the top of the stack will discuss the role of Hadoop in Big Data and how this all ties into analytical tools such as R.
Connection With Hadoop
All of databases mentioned in the previous article need a place to store their data while also recognizing that performance is a key feature of all of the them. A number of the tools mentioned have connections for using Hadoop as the storage platform. Hadoop is really not a file system. Rather, it is a software framework that supports data-intensive distributed applications, such as the ones discussed here and in the previous parts. When coupled with MapReduce, Hadoop can be a very effective solution for data-intensive applications.
The Hadoop File System (HDFS) is an open-source file system derived from Google's file system, aptly named Google File System (GFS). However, GFS is proprietary to Google. Hadoop is written in Java, and it is a distributed file system that is really a meta-file system -- in other words, a file system that sits on top of a lower-level file system. It is designed to be fault-tolerant, allowing copies of the data to be stored in various locations within the file system, so recovering from a corrupt copy of the data or a down server is rather easy. But these copies can also be used to improve performance (more on that later).
The fundamental building block for Hadoop is what is called a "datanode." This is a combination of a server with some storage and networking. The storage is usually either inside the server or directly attached to the server (DAS). Each datanode serves up the data over the network (Ethernet) using a block protocol unique to HDFS. A number of datanodes are distributed across multiple racks, and each datanode is partially identified by which rack it is in. There is also a metadata server, called the "Namenode," that also serves as the management node for HDFS. In addition, there is a secondary Namenode for HDFS, but it's not a fail-over metadata server. Rather, it is used for other file system tasks, such as snapshotting the directory information from the primary Namenode to help reduce downtime in the event of a Namenode failure. Regardless, because there is only one Namenode, it can become a potential bottleneck or single point of failure for HDFS.
One of the key features of HDFS is that the data is replicated across multiple datanodes to help with resiliency. By default, HDFS stores three copies of the data on different datanodes. Two copies are within the same rack, and one is on a different rack (i.e., you can lose a rack and still access your data). Jobs will be scheduled to run on a specific datanode that has the required data, so by default it is one of the three datanodes that has a copy of the data (Note: It may be subtle, but the datanodes store data and provide data access while alsorunning jobs).
This is the concept that many people refer to as "moving the job to the data, rather than moving the data to the job." This reduces data movement and also reduces the load on the network because data isn't being moved to run a job. Once the job starts running, all data access is done locally, so there is no striping across datanodes or using multiple data servers for parallel data access. However, large files are stored across multiple machines. The parallelism in Hadoop that can result in performance is achieved at the application level where several copies of the same application may be run, accessing different data sets. Moreover, improved performance can be achieved because you have three copies of the data, so you could potentially run three jobs that require access to the same file at the same time.
In the background, the datanodes will communicate with each other using RPC to perform a number of tasks:
- Capacity balance across datanodes while obeying the data replication rules
- Compare the files to each other so a corrupted copy of a file is overwritten with a correct copy
- Check the number of data copies and make additional copies if necessary
Note that HDFS is not a POSIX compliant file system primarily because the performance can be improved.
Accessing data in HDFS is fairly simple if you use the native Java API, the Thrift API, the command-line interface, or browse through the HDFS-UI webpage over HTTP. Beyond that, directly mounting HDFS on an operating system is not possible. The only solution at the time of this article is to use a Linux FUSE client that allows you to mount the file system.
Remember, Hadoop is based on the Google File System (GFS), which was developed to support Google's BigTable. Remember also that BigTable is a column-oriented database. Hence, it is more likely to support the Column Store tools mentioned in the previous article. But many of the tools previously mentioned have developed interfaces to and from Hadoop, so they can use it for storing data. The list below is not exhaustive, but it gives you an idea of which tools integrate with Hadoop.
- Wide Column Store/Column Families