Back in May, Henry kicked off a collaborative effort to examine some of the details behind the Big Data push and what they really mean. This article will continue our high-level examination of Big Data from the stop of the stack -- that is, the applications. Specifically, we will discuss the role of Hadoop and Analytics and how they can impact storage (hint, it's not trivial).
Henry and I have undertaken the task of examining Big Data and what it really means. It's a buzzword. Like many buzzwords, it has been beaten to death yet contains a kernel of real usefulness and technology. We've decided to tackle Big Data by doing some ground and pound on the subject, and find the kernels of truth and what they mean for storage solutions.
Henry kicked off the series with a great introduction, including what I consider to be the best definition for Big Data I've seen. Hence, I will repeat it, yet again:
|Big Data is the process of changing data into information, which then changes into knowledge.|
This definition is so appropriate because the adjective "Big" can mean many things to many fields of interest. Rather than focus on what some people think of as "Big" for their particular field, we can instead focus on what you do with the data and why.
Henry and I chose to tackle the discussion by coming from two different directions. Henry is starting at the very bottom of the stack with the hardware itself and then moving up through the stack. More precisely, what aspects of hardware are important for Big Data and what technologies are important? I'm starting at the top of the Big Data stack with the applications and then moving down through the stack. We'll meet somewhere in the middle and collect our ideas and comments into a final article.
Starting at the top isn't easy, and my original article became rather lengthy. So we collectively decided to cut it into three parts. The first part started off discussing some fundamental issues at the top, including the importance of getting data into a storage system for use (this is going to be more important than most people realize). It also discusses the most common tools for Big Data -- NoSQL databases. The second article examined eight NoSQL database classes used in Big Data that impact storage. This final look at the top of the stack will discuss the role of Hadoop in Big Data and how this all ties into analytical tools such as R.
Connection With Hadoop
All of databases mentioned in the previous article need a place to store their data while also recognizing that performance is a key feature of all of the them. A number of the tools mentioned have connections for using Hadoop as the storage platform. Hadoop is really not a file system. Rather, it is a software framework that supports data-intensive distributed applications, such as the ones discussed here and in the previous parts. When coupled with MapReduce, Hadoop can be a very effective solution for data-intensive applications.
The Hadoop File System (HDFS) is an open-source file system derived from Google's file system, aptly named Google File System (GFS). However, GFS is proprietary to Google. Hadoop is written in Java, and it is a distributed file system that is really a meta-file system -- in other words, a file system that sits on top of a lower-level file system. It is designed to be fault-tolerant, allowing copies of the data to be stored in various locations within the file system, so recovering from a corrupt copy of the data or a down server is rather easy. But these copies can also be used to improve performance (more on that later).
The fundamental building block for Hadoop is what is called a "datanode." This is a combination of a server with some storage and networking. The storage is usually either inside the server or directly attached to the server (DAS). Each datanode serves up the data over the network (Ethernet) using a block protocol unique to HDFS. A number of datanodes are distributed across multiple racks, and each datanode is partially identified by which rack it is in. There is also a metadata server, called the "Namenode," that also serves as the management node for HDFS. In addition, there is a secondary Namenode for HDFS, but it's not a fail-over metadata server. Rather, it is used for other file system tasks, such as snapshotting the directory information from the primary Namenode to help reduce downtime in the event of a Namenode failure. Regardless, because there is only one Namenode, it can become a potential bottleneck or single point of failure for HDFS.
One of the key features of HDFS is that the data is replicated across multiple datanodes to help with resiliency. By default, HDFS stores three copies of the data on different datanodes. Two copies are within the same rack, and one is on a different rack (i.e., you can lose a rack and still access your data). Jobs will be scheduled to run on a specific datanode that has the required data, so by default it is one of the three datanodes that has a copy of the data (Note: It may be subtle, but the datanodes store data and provide data access while also running jobs).
This is the concept that many people refer to as "moving the job to the data, rather than moving the data to the job." This reduces data movement and also reduces the load on the network because data isn't being moved to run a job. Once the job starts running, all data access is done locally, so there is no striping across datanodes or using multiple data servers for parallel data access. However, large files are stored across multiple machines. The parallelism in Hadoop that can result in performance is achieved at the application level where several copies of the same application may be run, accessing different data sets. Moreover, improved performance can be achieved because you have three copies of the data, so you could potentially run three jobs that require access to the same file at the same time.
In the background, the datanodes will communicate with each other using RPC to perform a number of tasks:
- Capacity balance across datanodes while obeying the data replication rules
- Compare the files to each other so a corrupted copy of a file is overwritten with a correct copy
- Check the number of data copies and make additional copies if necessary
Note that HDFS is not a POSIX compliant file system primarily because the performance can be improved.
Accessing data in HDFS is fairly simple if you use the native Java API, the Thrift API, the command-line interface, or browse through the HDFS-UI webpage over HTTP. Beyond that, directly mounting HDFS on an operating system is not possible. The only solution at the time of this article is to use a Linux FUSE client that allows you to mount the file system.
Remember, Hadoop is based on the Google File System (GFS), which was developed to support Google's BigTable. Remember also that BigTable is a column-oriented database. Hence, it is more likely to support the Column Store tools mentioned in the previous article. But many of the tools previously mentioned have developed interfaces to and from Hadoop, so they can use it for storing data. The list below is not exhaustive, but it gives you an idea of which tools integrate with Hadoop.
- Wide Column Store/Column Families
Any of the tools that are wide column or even key-value stores, particularly HBase and Hypertable, can integrate with Hadoop as you can see in the list. However, you can also see that a number of tools do not use Hadoop and instead rely on other storage methods.
If you don't want to use one of the NoSQL databases and mess with the Hadoop integration, there are two tools that the Hadoop community has designed to work with Hadoop to give it some search capability. The first, called Hive, is a data warehouse package that also has some querying features. More precisely, it can perform data summarization and some ad-hoc queries along with larger scale analysis of data sets stored in Hadoop. The queries are handled with an SQL-like language called HiveQL that allows you to perform basic searches on the data as well as allow you to plug in your own mapping code and reduction code into the code (see subsequent discussion about MapReduce).
The second tool is Pig. Pig goes beyond just querying the data and adds analysis capabilities (see subsequent section on analytics using R). Like Hadoop, Pig was designed for parallelism, which makes it a good fit for Hadoop. Pig has a high-level language called Pig Latin, which allows you to write data analysis code for accessing data that resides in Hadoop. This language is compiled to produce a series of MapReduce programs that run on Hadoop (see subsequent section on MapReduce).
Hadoop is one of the technologies people are exploring for enabling Big Data. If you use Google to search on Hadoop architectures, you will find a number of links, but generally the breadth of applications and data in Big Data is so large that it is impossible to develop a general Hadoop storage architecture. With that said, here are some general rules of thumb or guidelines in architecting a Hadoop storage solution:
- You need enough disks in a datanode to satisfy the application IO demands. This means that the number of disks can vary by quite a bit, but you must understand how the application is accessing the data (IO pattern) and the related IO requirements to properly chose the number of disks. Is the data access more streaming oriented or more IOPS oriented? Are there more reads than writes? How much does IO influence the run time?
- A second rule of thumb is to define how much parallelism you think you might need. This information can tell you how many datanodes you need, which also tells you how many datanodes may be accessing the same data file. So if you think you can get lots of parallelism in your application, then you will need a fair number of datanodes, and you may have to increase the number of data copies from three to a larger number.
- If you don't have enough datanodes, your network traffic will increase rather alarmingly. This is because Hadoop may need to copy data from one datanode to where it is needed (another datanode). Hadoop may also need to do some housekeeping in the background, such as deleting too many copies of the data or updating the copies, which also puts more pressure on the network.
- As a general rule of thumb, put enough capacity in each datanode to hold the largest file. Remember, Hadoop doesn't do striping so the entire file is located on the datanode. Hence, it must reside on the node in its entirety. Don't skimp on disks.
These are really high-level tips, but hopefully they get you started in the right direction to get the answers you need for architecting a Hadoop system.
At the beginning of this article I used Henry's definition for Big Data, which means taking data and turning it into information (as a first step). Given that Big Data usually implies lots of data (but not always), this means that there could be lots of processing. You are not likely to run one application against a single set of data producing some end result. Instead, you are likely to run various different analyses against multiple data sets with various parameters and collect information (results) from each and store those in a database for further processing. This implies a large number of runs over different, potentially large, data sets, resulting in lots of results. How do you coordinate and configure all of these runs?
One way to do this is to use something called MapReduce. In general terms, MapReduce is a framework for embarrassingly parallel computations that use potentially large data sets and a large number of nodes. Ideally, it also uses data that is stored locally on a particular node where the job is being executed. The computations are embarrassingly parallel because there is no communication between them. The run independent of one another.
As the name implies, MapReduce has two steps. The first step, the "Map" step, takes the input and breaks it into smaller sub-problems and distributes them to the worker nodes. The worker nodes then send their results back to the "master" node. The second step, the "Reduce" step, takes the results from the worker nodes and combines them in some manner to create the output, which is the output for the original job.
As you can tell from the description, MapReduce deals with distributed processing for both steps, but remember, the processing is designed to be embarrassingly parallel. This is where MapReduce gets performance, performing operations in parallel. To get the most performance means that there is no communication between worker nodes, so no data is really shared between them (unlike HPC applications, which are MPI-based and can potentially share massive amounts of data).
However, there could be situations where the mapping operations spawn other mapping operations, so there is some communication between them, resulting in not so embarrassingly parallelism. Typically, these don't involve too much internode communication. In addition, parallelism can be limited by the number of worker nodes that have a copy of the data. If you have five nodes needing access to the same data file, but you have only three copies, two nodes will have to pull the data from a different worker node. This results in reduced parallelism and reduced performance. This is true for both the Map phase and the Reduce phase. On the other hand, three copies of the data allows three applications to access the same data, unlike serial applications where there is only one copy of the data.
At first glance, one would think MapReduce was fairly inefficient because it must break up the problem, distribute the problem (which may be sub-divided yet again), and then assemble all of the results from the worker node to create the final answer. That seems like a great deal of work just to set up the problem and execute it. For small problems, this is definitely true -- it's faster to execute the application on a single node than to use MapReduce.
Where MapReduce shines is in parallel operations that require a great deal of computational time on the worker nodes or the assembler nodes and for large data sets. If I haven't said it clearly enough, the "magic" of MapReduce is exploiting parallelism to improve performance.
Traditionally, databases are not necessarily designed for fault-tolerance when run in a clustered situation. If you lose a node in the cluster, then you have to stop the job, check the file system and database, then restart the database on fewer nodes and rerun the application. NoSQL databases, and most of the tools like it, were primarily designed for two things: 1) performance, particularly around data, and 2) fault-tolerance. One way that some of the tools get fault-tolerance is to use Hadoop as the underlying file system. Another way to achieve fault-tolerance is to make MapReduce fault-tolerant as well.
Remember, MapReduce breaks problems into smaller sub-problems and so on. It then takes the output from these sub-problems and assembles them into the final answer. This is using parallelism to your advantage in running your application as quickly as possible. MapReduce usually adds fault-tolerance because if a task fails for some reason, then the job scheduler can reschedule the job if the data is still available. This means MapReduce can recover from the failure of a datanode (or several) and still be able to complete the job.
Many times people think of Hadoop as a pure file system that you can use as a normal file system. However, Hadoop was designed to support MapReduce from the beginning, and it's the fundamental way of interacting with the file system. Applications that interact with Hadoop can use an API, but Hadoop is really designed to use MapReduce as the primary method of interaction. The coupling of multiple data copies with the parallelism of the MapReduce produces a very scale-out, distributed and fault-tolerant solution. Just remember that the design allows for nodes to fail without interrupting the processing. This means that you can also add datanodes to the system, and Hadoop and MapReduce will take advantage of them.
Connections With R
So far the tools I've mentioned have been focused on the database portion of the problem -- gathering the data and performing some queries. This is a very important part of the Big Data process (if there is such a thing), but it's not everything. You must take the results of the queries and perform some computations, usually statistical, on them such as, what is the average age of people buying a certain product in the middle of Kansas? What was the weather like when most socks were purchased (e.g., temperature, humidity and cloudiness all being factors)? What section of a genome is the most common between people in Texas and people in Germany? Answering questions like these takes analytical computations. Moreover, much of this computation is statistical in nature (i.e., heavily math oriented).
Without much of a doubt, the most popular statistical analysis package is called R. R is really a programming language and environment. It is particularly focused at statistical analysis. To add to the previous discussion of R, it has a wide variety of built-in capabilities, including linear and non-linear modeling, a huge library of classical statistical tests, time-series analysis, classification, clustering and a number of other analysis techniques. It also has a very good graphical capability, allowing you to visualize the results. R is an interpreted language, which means that you can run it interactively or write scripts that R processes. It is also very extensible allowing you to write code in C, C++, Fortran, R itself or even Java.
For much of Big Data's existence, R has been adopted as the lingua franca for analysis, and the integration between R and database tools is a bit bumpy but getting smoother. A number of the tools mentioned in this article series have been integrated with R or have articles explaining how to get R and that tool to interact. Since this is an important topic, I have a list of links below giving a few pointers, but basically, if you Google for "R+[tool]" where [tool] is the tool you are interested in, you will likely find something.
- Column Stores
Key-Value Store/Tuple Store + R
- CouchDB + R
MongoDB + R
Terrastore + R
- Article about Teradata add-on package for R
But R isn't the only analytical tool available or used. Matlab is also a commonly used tool. There are some connections between Matlab and some of the databases. There are also some connections with SciPy, which is a scientific tool built with Python. A number of tools can also integrate with Python, so integration with SciPy is trivial.
Just a quick comment about programming languages for Big Data. If you look through a number of the tools mentioned, including Hadoop, you will see that the most common language is Java. Hadoop itself is written in Java, and a number of the database tools are either written in Java or have Java connectors. Some people view this as a benefit, while others view it as an issue. After Java, the most popular programming languages are C or C++ and Python.
All of these tools are really useful for analyzing data and can be used to convert data into information. However, one feature that is missing is good visualization tools. How do you visualize the information you create from the data? How do you visually tell which information is important and which isn't? How do you present this information easily? How do you visualize information that has more than three dimensions or three variables? These are very important topics that must be addressed in the industry.
Whether you realize it or not, visualization can have an impact on storage and data access. Do you store the information or data within the database tool or somewhere else? How can you recall the information and then process it for visualization? Questions such as these impact the design of your storage solution and its performance. Don't take storage lightly.
Transforming Information to Knowledge
Remember that our wonderful definition for Big Data involves two steps: 1) transforming data to information, and 2) transforming information to knowledge. Both steps aren't easy and can involve a great deal of computation. But how do you do these transformations? The ultimate answer lies with the individual doing the analyses and the particular field of study.
However, let's briefly touch on one possible tool that could be useful -- neural networks.
Neural networks can be used for a variety of things, but my comments are not directed at using neural networks in the more traditional ways. Rather,consider taking the data you are interested in or even the information, and training a neural network with some defined inputs and defined outputs. Hopefully, you have more than just a couple of outputs since you can many times just create a number of 2D plots to visualize the outputs as a function of the inputs (unless you have a huge number of inputs). Once you have a trained net, a very useful feature is to examine the details of the network itself. For example, examining the weights connecting the inputs to the hidden layer and from the hidden layer to the outputs can possibly tell you something about how important various inputs are to the output. Or how combinations of inputs can affect the output or outputs. This is even more useful when you have several, possibly many, outputs and you want to examine how inputs affect each of the outputs.
Neural networks could enjoy a renaissance of sorts in Big Data if they are used to help examine the information and perhaps even turn it into knowledge.
This and the previous two articles are intended to be a starting point for discussing Big Data from the top, while the first article in the series started at the bottom. But the topic of Big Data is so broad and so over-hyped that it is difficult to concisely say what Big Data is and why it is a real topic and not YABW (yet another buzzword). There are several facets to Big Data that must be carefully considered before diving in head first. Some of the facets that I have tried to discuss are:
- What is Big Data?
- Why is it important or useful?
- How do you get data into "Big Data"?
- How do you store the data?
- What tools are used in Big Data, and how can these influence storage design?
- What is Hadoop, and how can it influence storage design?
- What is MapReduce, and how does in integrate with Big Data?
Hopefully, the discussion has caused you to think and perhaps even use Big Data tools like Google to search for information and create knowledge (sorry -- had to go there). If you are asking more questions and wondering about clarification that means you have gotten what I intended from the article.
And now, back over to Henry!
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.