The Top of the Big Data Stack: Applications Classification and Data Storage
Henry kicked off our latest collaborative effort to examine some of the details behind the Big Data push and what they really mean. In my first article, I'm going to begin at the beginning, where every discussion around Big Data should begin -- at the top. What is Big Data, what are we trying to solve with it, and what tools are there? As you can imagine, this topic is huge. I will address each of these questions in its own article. This first of three parts will examine, what is Big Data.
Henry Newman and I seem to talk almost weekly, except when Henry is fishing or I'm traveling. Our conversations roam all over the map, but the one topic we seem to come back to is Big Data. It's the storage equivalent of the server world's "Cloud Computing." It's the buzzword that everyone is throwing around, tacking onto their latest products (or their warmed-over products), or using it to justify expenditures. It is being used by the SSD and fast storage community as a rallying cry for products just as much as many people are using it for justifying cramming Hadoop into every nook and cranny ("if I just sprinkle a little Hadoop on the system, data and insight will magically appear"). The phrase has become bothersome and annoying in part because at the same time people are asking what Big Data means, what it means to them, and what it takes to use the technology (if you can call a buzzword a technology). Henry and I have accepted the mission of a high-level examination of Big Data in the hopes that we can help add some clarification to the confusion while hoping to learn some things ourselves, particularly how Big Data impacts storage.
I want to start off my first Big Data article with my own twist on Henry's spot-on definition of Big Data from his first article.
|Big Data is the process of changing data into information, which then changes into knowledge.|
I truly think this is essence of what Big Data is all about. It's not about the hardware. It's not about the software. And it's not specific on a specific field or vertical. This definition applies to all fields of inquiry and to all situations. It doesn't say that you have to have Petabytes of data. It doesn't say you need billions of files. It doesn't say that you must run a parallel-distributed database. It doesn't say that you are in commodity trading, bioinformatics, business analysis, nuclear physics, soap box derby car design, or examining the voting records of the supreme court justices. The definition is simple and yet profound -- it's all about what you do with the data and what you want to gain from it (i.e., knowledge).
However, as you have probably guessed, this topic is massive. Consequently, I've had to split this article into three parts to make it a bit more digestible. This first part sets the stage and discusses how to get "data" into "Big Data." In the next part, I will focus on the applications by breaking them into three classifications.
The second article will dive into NoSQL applications for the various classifications and points out how they are different and how they affect storage. The third article will focus on how these applications interact with Hadoop (storage) and R (analysis tool) to create information from data. It also finishes up the general topic of Big Data applications by explaining how they impact storage architectures.
So let's get started by using Henry's wonderful definition of Big Data.
Big Data Buzz vs. Big Data Reality
Contrary to popular belief, buzzwords usually develop around something real. This is absolutely true with Big Data. Big Data really started with the Web 2.0 companies that started indexing the entire Web and allowed people to search it. As part of this, companies needed to manage extremely large amounts of data and develop ways to use the data to create information. Think of things like Google's page rank as a way of taking large amounts of data and creating information from it. In turn, people who used Google turned that information into knowledge.
There are some suitable aspects to this that are important that usually get overlooked. One of them is that Big Data has access to data, most likely more than ever. This usually leads to the use of the name Big Data. In the case of web information indexing, you had access to the data on the web. You can think of this as "sensor data." That is, data from some sort of measurements or instrumentation. Sensor data for the web is just the web data itself -- web pages, links, and so on. However, the definition of Big Data that I'm using doesn't say anything about the amount of data -- just that it has access to data. What is also subtlety missed by many people is that the amount of data is much larger than what has been used before, but it does not actually have a number over which it is considered Big Data. For example, I work in HPC and we're used to talking about Petabytes of data, and millions, if not billions of files. However, I was talking with an expert in the field of "smart grid" who was talking about the problems of Big Data. When I asked her how they defined Big Data, she answered that it was about 40TB and about 1 million files. To the field of smart grid, this is Big Data because it is much, much larger than the amount of data they are used to working with but it is much smaller than the HPC field in general. Moreover, the Smart Grid community is trying to take the data and turn it into information that can be used. This reinforces the point that the amount of data is not the point of Big Data.
Other disciplines have their own "data sensors." For bioinformatics and the computational biology fields, there are genome sequencers, MRI images, x-rays, blood tests, and a range of other tests. For businesses, it can be something as simple as POS (Point of Sale) information coupled with customer information via frequent purchaser cards. But it can also include information about the state of the world such as weather (local and global), economic information, traffic information, sporting and other community events, demographics, television programming, and other sources of information that describe the "state" of the world around the business. All of this is "sensor data" and Big Data is really about taking that data and first turning it into information.
Here we are with data coming in fast and furious from various disciplines and from many sensors with the supposition that maybe there is some useful information in there that can be used to gain knowledge. Some people are even saying that the old Scientific Method where you put forth a hypothesis and try to prove or disprove it, has been replaced with data driven science where you explore the data using tools to make discoveries or insights (knowledge). I'm not prepared to go that far just yet, but there is something interesting about exploring data using sophisticated tools (reminds me of the end of Carl Sagan's novel, Contact where the protagonist is searching for meaning in a data stream). All of this is what has led people to create tools for examining large (whatever large means to you), possibly distributed, unstructured, possibly unrelated, and quickly growing pools of data to try to create information and ultimately knowledge.
One of the fundamental steps in Henry's definition of Big Data is changing data into information. This is what I will be discussing in this article. What applications or classes of application can we use for converting data into information? How does this impact storage? I will be listing a number of applications and tools along with links to them as well as articles and discussion, but by no means is this an exhaustive list and the intent of this article is not to be a survey article with a bunch of links. Hopefully I'll turn this bunch of "data" into information and then coupled with Henry's articles we'll create some knowledge.