Next Steps - Page 3
Now What Do You Do?
Some people reading this article will complain that I haven't said word one about "Big Data Applications" or "Big Data Analytics," and there is good reason for that -- there are many varieties and types including open-source and proprietary. I divide up the applications into three classes:
- Custom code
- Analytics oriented (e.g. "R")
- Database based on the concepts of NoSQL
I can't say too much about the first class of applications because the entire thing is custom code. Custom code to get data into and out of storage, custom storage, customer APIs and customer applications. But one thing that I can say about this class that also applies to the other two classes is that the applications are frequently written in Java.
The second class of applications uses very heavy mathematical or statistical oriented analysis tools almost exclusively. Since Big Data is about converting data into information, statistics are usually employed to make sense of the data and create some general information about it. One of the most common tools in this regard is called R. R is really a programming language that is based on the S language, but there is an open-source implementation of R that many people use for statistical analysis. It is an extremely powerful language and tool with a very large number of add-ons including visualization tools. There are also efforts to add parallel capabilities to R, called R-Parallel, to allow it to scale in terms of computational capability. There are also a number of efforts at integration of R with Hadoop (more on that later in the article). But this class of applications stores the data in R-readable forms and also allows access to the data using R methods.
You will find that as we go through some classes of applications, R is integrated with quite a few of them. There are other languages as well, such as Matlab and even SciPy, but R gives you the ability to do very sophisticated statistical analysis of the data.
The third class, NoSQL databases, is definitely the largest class with many options for tools depending upon the data and the relationships within the data or databases. The term "NoSQL" means that the databases don't adhere to the common terminology for databases. In general:
- They don't use SQL (Structured Query Language) as their core language
- They don't give full ACIDguarantees
- They have a distributed, fault-tolerant, and scalable architecture
As with many of the concepts in Big Data, NoSQL grew out of the Web 2.0 era. People working with data saw the huge explosion in the quantity of data and, perhaps more importantly, the need to analyze the data to create information and hopefully knowledge. Due to the perception that the amount of data was increasing rapidly, a number of developers decided to focus on speed of data retrieval and data append operations rather than the other classical aspects of databases. As part of the design they dropped many of the features that people expected in an RDBM. Instead, they wanted to focus on aspects critical to their work like storing millions (or more) of key-value pairs in a few simple associative arrays and then retrieve them for statistical analysis or general processing. Additionally, they wanted to be able to add to the arrays as more data came in from "data sensors." This is true for storing millions of data records and performing similar operations. But the driving focus is storing and retrieving huge amounts of data for processing and not necessarily focusing on the relationships between the data or other aspects that might reduce the performance.
NoSQL databases also focused on being scalable so they are, by design, distributed. The data is typically stored redundantly on several servers so the loss of a server can be tolerated. It also means that the database can be scaled just by adding more servers or storage. If you need more processing capability, you just add more servers. If you need more storage, you can either add more storage to the existing servers or you can add more servers.
There are several ways to classify NoSQL databases, and in future articles, I will dive into these eight NoSQL groupings and discuss why people use them and how they might impact data storage.
- Wide Column Store/Column Families
- Document Store
- Key Value/Tuple Store
- Graph Databases
- Multimodel Databases
- Object Databases
- Multivalue databases?
- RDF databases?
The variety of classes illustrates the creativity and wide range of fields where NoSQL databases are being used. In the subsequent articles, I'll briefly discuss each one because, believe it or not, the design of these applications can have a big impact on the design the storage.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.