Henry kicked off our latest collaborative effort to examine some of the details behind the Big Data push and what they really mean. In my first article, I'm going to begin at the beginning, where every discussion around Big Data should begin -- at the top. What is Big Data, what are we trying to solve with it, and what tools are there? As you can imagine, this topic is huge. I will address each of these questions in its own article. This first of three parts will examine, what is Big Data.
Henry Newman and I seem to talk almost weekly, except when Henry is fishing or I'm traveling. Our conversations roam all over the map, but the one topic we seem to come back to is Big Data. It's the storage equivalent of the server world's "Cloud Computing." It's the buzzword that everyone is throwing around, tacking onto their latest products (or their warmed-over products), or using it to justify expenditures. It is being used by the SSD and fast storage community as a rallying cry for products just as much as many people are using it for justifying cramming Hadoop into every nook and cranny ("if I just sprinkle a little Hadoop on the system, data and insight will magically appear"). The phrase has become bothersome and annoying in part because at the same time people are asking what Big Data means, what it means to them, and what it takes to use the technology (if you can call a buzzword a technology). Henry and I have accepted the mission of a high-level examination of Big Data in the hopes that we can help add some clarification to the confusion while hoping to learn some things ourselves, particularly how Big Data impacts storage.
I want to start off my first Big Data article with my own twist on Henry's spot-on definition of Big Data from his first article.
|Big Data is the process of changing data into information, which then changes into knowledge.|
I truly think this is essence of what Big Data is all about. It's not about the hardware. It's not about the software. And it's not specific on a specific field or vertical. This definition applies to all fields of inquiry and to all situations. It doesn't say that you have to have Petabytes of data. It doesn't say you need billions of files. It doesn't say that you must run a parallel-distributed database. It doesn't say that you are in commodity trading, bioinformatics, business analysis, nuclear physics, soap box derby car design, or examining the voting records of the supreme court justices. The definition is simple and yet profound -- it's all about what you do with the data and what you want to gain from it (i.e., knowledge).
However, as you have probably guessed, this topic is massive. Consequently, I've had to split this article into three parts to make it a bit more digestible. This first part sets the stage and discusses how to get "data" into "Big Data." In the next part, I will focus on the applications by breaking them into three classifications.
The second article will dive into NoSQL applications for the various classifications and points out how they are different and how they affect storage. The third article will focus on how these applications interact with Hadoop (storage) and R (analysis tool) to create information from data. It also finishes up the general topic of Big Data applications by explaining how they impact storage architectures.
So let's get started by using Henry's wonderful definition of Big Data.
Big Data Buzz vs. Big Data Reality
Contrary to popular belief, buzzwords usually develop around something real. This is absolutely true with Big Data. Big Data really started with the Web 2.0 companies that started indexing the entire Web and allowed people to search it. As part of this, companies needed to manage extremely large amounts of data and develop ways to use the data to create information. Think of things like Google's page rank as a way of taking large amounts of data and creating information from it. In turn, people who used Google turned that information into knowledge.
There are some suitable aspects to this that are important that usually get overlooked. One of them is that Big Data has access to data, most likely more than ever. This usually leads to the use of the name Big Data. In the case of web information indexing, you had access to the data on the web. You can think of this as "sensor data." That is, data from some sort of measurements or instrumentation. Sensor data for the web is just the web data itself -- web pages, links, and so on. However, the definition of Big Data that I'm using doesn't say anything about the amount of data -- just that it has access to data. What is also subtlety missed by many people is that the amount of data is much larger than what has been used before, but it does not actually have a number over which it is considered Big Data. For example, I work in HPC and we're used to talking about Petabytes of data, and millions, if not billions of files. However, I was talking with an expert in the field of "smart grid" who was talking about the problems of Big Data. When I asked her how they defined Big Data, she answered that it was about 40TB and about 1 million files. To the field of smart grid, this is Big Data because it is much, much larger than the amount of data they are used to working with but it is much smaller than the HPC field in general. Moreover, the Smart Grid community is trying to take the data and turn it into information that can be used. This reinforces the point that the amount of data is not the point of Big Data.
Other disciplines have their own "data sensors." For bioinformatics and the computational biology fields, there are genome sequencers, MRI images, x-rays, blood tests, and a range of other tests. For businesses, it can be something as simple as POS (Point of Sale) information coupled with customer information via frequent purchaser cards. But it can also include information about the state of the world such as weather (local and global), economic information, traffic information, sporting and other community events, demographics, television programming, and other sources of information that describe the "state" of the world around the business. All of this is "sensor data" and Big Data is really about taking that data and first turning it into information.
Here we are with data coming in fast and furious from various disciplines and from many sensors with the supposition that maybe there is some useful information in there that can be used to gain knowledge. Some people are even saying that the old Scientific Method where you put forth a hypothesis and try to prove or disprove it, has been replaced with data driven science where you explore the data using tools to make discoveries or insights (knowledge). I'm not prepared to go that far just yet, but there is something interesting about exploring data using sophisticated tools (reminds me of the end of Carl Sagan's novel, Contact where the protagonist is searching for meaning in a data stream). All of this is what has led people to create tools for examining large (whatever large means to you), possibly distributed, unstructured, possibly unrelated, and quickly growing pools of data to try to create information and ultimately knowledge.
One of the fundamental steps in Henry's definition of Big Data is changing data into information. This is what I will be discussing in this article. What applications or classes of application can we use for converting data into information? How does this impact storage? I will be listing a number of applications and tools along with links to them as well as articles and discussion, but by no means is this an exhaustive list and the intent of this article is not to be a survey article with a bunch of links. Hopefully I'll turn this bunch of "data" into information and then coupled with Henry's articles we'll create some knowledge.
Big Data's Beginnings
Before jumping into a list of applications that can possibly be used for Big Data, we should take a step back and talk about how we "feed the beast." That is, how do we access the data and how do we store it? (i.e., file formats) I consider this to be a crucial part of examining the "top-end" of Big Data.
Getting data into the system we're using for analysis is not as easy as it seems. In many cases, the data is taken from disparate sources that have their own file format or their own storage system. Moreover, some data sources create data at a very fast pace. An example of this is video monitoring that streams data from cameras to storage. Today, virtually every store, gas station, Starbucks, street corner and highway has some sort of video camera. This data must be collected and stored for future analysis. Before you say that much of this data is pointless, you might be surprised.
Remember the tornadoes in spring 2012? Just a few days after they passed, videos from gas station cameras and truck parking lots starting showing up on television. Making sensational videos for the nightly news was not the point of these videos. Rather, they can be used to track the path of the tornado, which is useful information for people who model tornadoes near the ground as well as for storm trackers who want to know where exactly the tornado touched down. It's also useful information for insurance companies so they know precisely where the tornado touched down. And this information is also useful for the local governments because the tornado path information lets them look for homes or businesses that might have been hit. This enables them to look for people, broken gas lines, downed power lines and cars that might have been moving during the tornado. If you are interested in "Big Data" and creating information and knowledge from data, then you need systems that can quickly gather this data and write it to storage someone for quick analysis and retrieval.
This discussion about ingesting data from "sensors" into storage brings up a question: Do we leave the data where it is and access the data remotely, or do we pull the data into a centralized system and then analyze it? There are benefits and difficulties with each approach, so you must carefully examine the details of each approach. Deciding whereto put the data is up to you, but you have to realize that there are implications in leaving it in disparate locations.
Issues to think about include:
- Data access methods? (e.g., REST, WebDAV and NFS)
- Data permissions? (Who gets access to the data? Permissions?)
- Data security
- Performance (e.g., bandwidth and latency)
- Data access priority? (local access trumps remote access?)
- Data integrity?
- Management of the data?
- Responsibility for the data?
These are what you might think of as "basic" issues because we haven't even discussed the data itself. However, in the brave new world of "cloud" everything these are issues you will have to address.
I don't want to dive into these topics in any depth except for one -- data access methods. This is important because it deals with the design of the application, which leads to the design of the storage system and the network (among other things). There are several interfaces for accessing information on the Internet, but the one that seems to have become the latest standard is called REST (Representational State Transfer). You can use a REST interface to pull data from a disparate server to your local server if you wish to do that. But you can also use REST as an interface to perform some function on the remote server on behalf of the client which is your server. Defining the exact details of the interface and how it works is one of the THE critical steps in creating a reasonably useful environment for accessing data from disparate resources.
A companion to a REST interface is something called WebDAV (Web-based Distributed Authoring and Versioning). WebDAV is a child of the Web 2.0 movement. It allows users to collaboratively edit and manage files on disparate web servers. You can write applications using the WebDAV interface to access data on different servers. But one of the coolest features of WebDAV is that you can maintain the properties including the author, modification date, namespace management, collections and overwrite protection, all of which are very useful in Big Data.
After considering how to access the data, primarily to get it into our storage system, we really need to think about how the data is actually stored. For example, is the data stored in a flat text file and if so, how is the data organized and how do you access it? Is the data stored in some sort of database? Is the data stored in some standard format such as hdf5? Or is the data stored in a proprietary format? Answering this question is more critical than people think because you will have to write methods to get the data you want and retrieve it in virtually any tool. In many cases, the choice of how and where the data is written and stored depends on the actual applications being developed and how the results are created and stored. But be sure to pay close attention to this aspect of Big Data -- in what form will the data be stored?
Now What Do You Do?
Some people reading this article will complain that I haven't said word one about "Big Data Applications" or "Big Data Analytics," and there is good reason for that -- there are many varieties and types including open-source and proprietary. I divide up the applications into three classes:
- Custom code
- Analytics oriented (e.g. "R")
- Database based on the concepts of NoSQL
I can't say too much about the first class of applications because the entire thing is custom code. Custom code to get data into and out of storage, custom storage, customer APIs and customer applications. But one thing that I can say about this class that also applies to the other two classes is that the applications are frequently written in Java.
The second class of applications uses very heavy mathematical or statistical oriented analysis tools almost exclusively. Since Big Data is about converting data into information, statistics are usually employed to make sense of the data and create some general information about it. One of the most common tools in this regard is called R. R is really a programming language that is based on the S language, but there is an open-source implementation of R that many people use for statistical analysis. It is an extremely powerful language and tool with a very large number of add-ons including visualization tools. There are also efforts to add parallel capabilities to R, called R-Parallel, to allow it to scale in terms of computational capability. There are also a number of efforts at integration of R with Hadoop (more on that later in the article). But this class of applications stores the data in R-readable forms and also allows access to the data using R methods.
You will find that as we go through some classes of applications, R is integrated with quite a few of them. There are other languages as well, such as Matlab and even SciPy, but R gives you the ability to do very sophisticated statistical analysis of the data.
The third class, NoSQL databases, is definitely the largest class with many options for tools depending upon the data and the relationships within the data or databases. The term "NoSQL" means that the databases don't adhere to the common terminology for databases. In general:
- They don't use SQL (Structured Query Language) as their core language
- They don't give full ACIDguarantees
- They have a distributed, fault-tolerant, and scalable architecture
As with many of the concepts in Big Data, NoSQL grew out of the Web 2.0 era. People working with data saw the huge explosion in the quantity of data and, perhaps more importantly, the need to analyze the data to create information and hopefully knowledge. Due to the perception that the amount of data was increasing rapidly, a number of developers decided to focus on speed of data retrieval and data append operations rather than the other classical aspects of databases. As part of the design they dropped many of the features that people expected in an RDBM. Instead, they wanted to focus on aspects critical to their work like storing millions (or more) of key-value pairs in a few simple associative arrays and then retrieve them for statistical analysis or general processing. Additionally, they wanted to be able to add to the arrays as more data came in from "data sensors." This is true for storing millions of data records and performing similar operations. But the driving focus is storing and retrieving huge amounts of data for processing and not necessarily focusing on the relationships between the data or other aspects that might reduce the performance.
NoSQL databases also focused on being scalable so they are, by design, distributed. The data is typically stored redundantly on several servers so the loss of a server can be tolerated. It also means that the database can be scaled just by adding more servers or storage. If you need more processing capability, you just add more servers. If you need more storage, you can either add more storage to the existing servers or you can add more servers.
There are several ways to classify NoSQL databases, and in future articles, I will dive into these eight NoSQL groupings and discuss why people use them and how they might impact data storage.
- Wide Column Store/Column Families
- Document Store
- Key Value/Tuple Store
- Graph Databases
- Multimodel Databases
- Object Databases
- Multivalue databases?
- RDF databases?
The variety of classes illustrates the creativity and wide range of fields where NoSQL databases are being used. In the subsequent articles, I'll briefly discuss each one because, believe it or not, the design of these applications can have a big impact on the design the storage.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.