In May, Henry kicked off a collaborative effort to examine some of the details behind the Big Data push and what they really mean. This article is the third in our muiltipart series and the second of three to take a high-level examination of Big Data from the top of the stack — that is, the applications.
Introduction
Henry and I have undertaken the task of examining Big Data and what it really means. It’s a buzzword that, like many buzzwords, has been beaten to death, yet contains a kernel of real usefulness and technology. We’ve decided to tackle Big Data, throw it into the turnbuckle, and find the kernels of truth and what they mean for storage solutions.
Henry kicked off the series with a great introduction, including what I consider to be the best definition for Big Data.
Big Data is the process of changing data into information, which then changes into knowledge. |
This definition is so appropriate because the adjective “Big” can mean many things to many fields of interest. So rather than focus on what some people may think is “Big” for their particular field, we can instead focus on what you do with the data and why.
Henry and I have chosen to tackle the discussion by coming from two different directions. Henry started at the very bottom with the hardware itself and then moved up through the stack. More precisely, what aspects of hardware are important for Big Data and what technologies are important. I’m starting at the top of the Big Data stack with the applications and then moving down through the stack. We’ll meet somewhere in the middle and collect our ideas and comments into a final articles.
Starting at the top isn’t easy, and my original article became rather lengthy, so we broke it into three parts. The first part discusses some fundamental issues at the top of the stack, including the importance of getting the data into a storage system for use. This step is actually more important than most people realize. The article also discusses the most common tools for Big Data — NoSQL databases. At the end of the first article, these tools were broken into eight groups.
- Wide Column Store/Column Families
- Document Store
- Key Value/Tuple Store
- Graph Databases
- Multimodel Databases
- Object Databases
- Multivalue databases
- RDF databases
In this article, the second part to my top-level series, I dive into each of these eight classes, with an eye toward what it means for data storage.
Wide Column Store/Column Families
Some people will refer to Wide Column Stores or Column Families as “BigTable Clones.” In many ways, BigTable is the grandfather (or grandmother) of Big Data NoSQL databases. It is a proprietary application that Google has built on top of its other technologies, such as Map-Reduce, the Google File System, Chubby File Locking, SSTable and some other bits.
The design of BigTable is fairly simple. It maps two arbitrary string values (row key and column key) and a timestamp into an associated arbitrary byte array. Think of it as a sparsely distributed multi-dimensional sorted map, if that phrase makes things easier for you. It is designed to scale into the Petabyte range across hundreds or thousands of servers with the ability to easily add servers to scale resources (compute and storage). Consequently, the design of BigTable is focused on performance and scalability at the expense of other aspects.
Some of you may ask why databases are column oriented instead of the normal row orientation. Great question! Classic row-oriented databases have data in rows like the following:
Number | Name | Height |
---|---|---|
1 | Johnny | 68 |
2 | Lisa | 59 |
3 | Rick | 78 |
In essence, a CSV version of the data looks like the following:
1,Johnny,68 2,Lisa,59 3,Rick,78 |
A column-oriented database transposes the data orientation, so that it looks like the following:
1,2,3 Johnny,Lisa,Rick 68,59,78 |
You can see how the data has been transformed into a column-oriented database.
The benefit of column-oriented databases is all-around performance and data access patterns from the hardware (hence, Henry’s bottom up approach in this series). From the Wikipedia article about column-oriented databases, the general benefits of column oriented databases are:
- “Column-oriented databases are typically more efficient when an aggregate function needs to be computed over many rows but only for a subset of all of the columns of data because reading that smaller subset of data can be faster than reading all data”
- “Column-oriented databases are more efficient when new values of a column are supplied for all rows at once because that column data can be written efficiently and replace old column data without touching any other columns for the rows.”
For column stores, the workloads can typically be a fairly small number of very complex queries over similar data, as some Big Data workloads tend to perform. Examples of this are data warehouses, customer relationship management (CRM) systems and library card catalogs. However, column-oriented databases are not perfect, and there are some trade-offs.
There are several example of column-oriented databases or data stores:
Key-Value/Tuple Store
The whole idea behind key-value stores is that there is one key, one value and no duplicates. There is an unbending focus on performance. For the more computer science oriented readers it’s a hash table. Kay-Value Stores allow you to store more than a single value with a key. You can also store a blob (binary) object associated with a key. However, the database itself does not understand the blob; nor does it want to — its focus is on speed, speed and more speed (and, in case you missed it, speed).
Key-Value stores, also called tuple stores, are very popular in Big Data because they are so focused on performance. If you are interested in speed or if you have very large amounts of data, this is a very popular way to access the data. However, there are some pitfalls to the approach. If you use binary objects (blobs) as the value for the key, the database knows nothing about it, forcing you to write custom code for using the returned blob. This may not be a problem because you most likely have to write custom code anyway, so what’s a little more code?
The emphasis on performance can put a great strain on storage systems. The value associated with a key can be large or small. If it’s fairly large, then the data access patterns have some fairly reasonable streaming aspects, so they can be read in one large read. But if the data is small, then the data access patterns begin to look very random, resulting in data access patterns that are IOPS-driven. This impacts the architecture of the storage system.
You likely won’t find Key-Value databases with links to analysis tools. The focus is on storing and retrieving data at an extremely high rate, forcing you to write the code that processes the data. Examples of key-value databases that are used in Big Data are:
- BigTable
- levelDB
- Couchbase Server
- Berkeley DB
- Voldemort
- Cassandra
- MemcacheDB
- Amazon DynamoDB
- Dynomite
Document Store
Document Stores are very similar to the Key-Value Stores we previously discussed, but in this case the database typically also understands the value that is returned from the key-value pair. It is typically structured so that the database can understand it. This means that instead of just having a data retrieval system where you must write custom code to use the returned data, Document Stores understand the returned value. Moreover, this also means you can typically query the data in addition to just using the key to retrieve a blob. However, this capability is achieved at the expense of some performance and the fact that the data typically must be structured in some manner (i.e., a defined format that the database understands).
Be careful about the title “Document Store” because Document Store does not mean that the value is a .pdf file or a .docx file, although that is possible. In general, a Document refers to the value in the Key-Value, and it is assumed that the document encapsulates and encodes the data in a standard format or encoding. Examples of formats include XML, JSON, YAML and BSON, but you can also use binary formats such as .pdf, .docx, or .xls. Moreover, you can actually use a Document Storage database as you would a key-value database, if the data format is unknown to the database itself (perhaps using a proprietary data format). If you like, think of the data in the document as a row in a traditional database. The only significant difference is that it is more flexible than a row because it does not have to conform to the row schema.
Document Stores can be fairly easy on storage systems because the data can be read in a large read, emphasizing the streaming nature of the data. On the other hand, if you use small amounts of data, then the storage starts to think that the data access patterns are rather random and IOPs driven.
Examples of Document Stores for Big Data include:
Graph Database
So far, the Big Data database tools have been all about performance with some basic relations between data (or in the case of Key-Value, no explicit relationships). Graph Databases go in the opposite direction and emphasize relationships among the data before all other aspects. A graph database uses the concepts from a graph structure such as nodes, edges and properties, to represent and store data. The cool thing about graph databases is that there is no real “lookup,” per se. Rather, an element contains pointers to adjacent elements. So you begin by starting with a node and start querying the data with the node or the properties and explore the relationship to other nodes via the edges.
In graph databases, the nodes contain information you want to keep track of, and the properties are pertinent information that relates to the node. Edges are lines that connect nodes to nodes or nodes to properties, and they represent the relationship between the two. These concepts can wreak havoc on storage systems. How to you access data that has connections throughout? Depending on the amount of data stored in the nodes and properties and how the edges connect the various nodes (and how much data is contained in the edges), the storage system can again think that the data access patterns are somewhat random and very IOPS-driven. But there are some situations where the data access patterns are fairly sequential when certain searches are performed (i.e., simple tracing of one node to another).
Graph databases are becoming popular in Big Data because they can be used to gain more information (insight) into the data because of the edges. Examples of graph databases are:
- InfoGrid
- HyperGraphDB(uses Berkeley DB for data storage)
- AllegroGraph
- BigData
Object Database
Object databases are exactly what the phrase implies — a combination of a database with objects where objects are as used in object-oriented programming. Because of this combination, they belong to a distinct group of databases. Typically these databases integrate the database capability with object-oriented programming, so the database is usually tightly coupled with the object language, which can limit general applicability. Examples of Object databases for Big Data are the following:
Multi-model Database
This class of databases is named such because of its ability to provide multiple techniques depending on what you want to achieve. For example, the database could be a Document Store or a Graph Database or both. Hence, the name of multi-model databases. Examples of these databases are:
In the next article in this series, I’ll present how these classes of databases can use Hadoop for data storage. I’ll also discuss how these databases, while very important, are not the end-all to Big Data. You need tools to analyze the data to create information (and sometimes knowledge).
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn’t require diplomatic skills. Diplomacy’s loss was HPC’s gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.