More Groups of Tools - Page 2
The whole idea behind key-value stores is that there is one key, one value and no duplicates. There is an unbending focus on performance. For the more computer science oriented readers it's a hash table. Kay-Value Stores allow you to store more than a single value with a key. You can also store a blob (binary) object associated with a key. However, the database itself does not understand the blob; nor does it want to -- its focus is on speed, speed and more speed (and, in case you missed it, speed).
Key-Value stores, also called tuple stores, are very popular in Big Data because they are so focused on performance. If you are interested in speed or if you have very large amounts of data, this is a very popular way to access the data. However, there are some pitfalls to the approach. If you use binary objects (blobs) as the value for the key, the database knows nothing about it, forcing you to write custom code for using the returned blob. This may not be a problem because you most likely have to write custom code anyway, so what's a little more code?
The emphasis on performance can put a great strain on storage systems. The value associated with a key can be large or small. If it's fairly large, then the data access patterns have some fairly reasonable streaming aspects, so they can be read in one large read. But if the data is small, then the data access patterns begin to look very random, resulting in data access patterns that are IOPS-driven. This impacts the architecture of the storage system.
You likely won't find Key-Value databases with links to analysis tools. The focus is on storing and retrieving data at an extremely high rate, forcing you to write the code that processes the data. Examples of key-value databases that are used in Big Data are:
- Couchbase Server
- Berkeley DB
- Amazon DynamoDB
Document Stores are very similar to the Key-Value Stores we previously discussed, but in this case the database typically also understands the value that is returned from the key-value pair. It is typically structured so that the database can understand it. This means that instead of just having a data retrieval system where you must write custom code to use the returned data, Document Stores understand the returned value. Moreover, this also means you can typically query the data in addition to just using the key to retrieve a blob. However, this capability is achieved at the expense of some performance and the fact that the data typically must be structured in some manner (i.e., a defined format that the database understands).
Be careful about the title "Document Store" because Document Store does not mean that the value is a .pdf file or a .docx file, although that is possible. In general, a Document refers to the value in the Key-Value, and it is assumed that the document encapsulates and encodes the data in a standard format or encoding. Examples of formats include XML, JSON, YAML and BSON, but you can also use binary formats such as .pdf, .docx, or .xls. Moreover, you can actually use a Document Storage database as you would a key-value database, if the data format is unknown to the database itself (perhaps using a proprietary data format). If you like, think of the data in the document as a row in a traditional database. The only significant difference is that it is more flexible than a row because it does not have to conform to the row schema.
Document Stores can be fairly easy on storage systems because the data can be read in a large read, emphasizing the streaming nature of the data. On the other hand, if you use small amounts of data, then the storage starts to think that the data access patterns are rather random and IOPs driven.
Examples of Document Stores for Big Data include:
So far, the Big Data database tools have been all about performance with some basic relations between data (or in the case of Key-Value, no explicit relationships). Graph Databases go in the opposite direction and emphasize relationships among the data before all other aspects. A graph database uses the concepts from a graph structure such as nodes, edges and properties, to represent and store data. The cool thing about graph databases is that there is no real "lookup," per se. Rather, an element contains pointers to adjacent elements. So you begin by starting with a node and start querying the data with the node or the properties and explore the relationship to other nodes via the edges.
In graph databases, the nodes contain information you want to keep track of, and the properties are pertinent information that relates to the node. Edges are lines that connect nodes to nodes or nodes to properties, and they represent the relationship between the two. These concepts can wreak havoc on storage systems. How to you access data that has connections throughout? Depending on the amount of data stored in the nodes and properties and how the edges connect the various nodes (and how much data is contained in the edges), the storage system can again think that the data access patterns are somewhat random and very IOPS-driven. But there are some situations where the data access patterns are fairly sequential when certain searches are performed (i.e., simple tracing of one node to another).
Graph databases are becoming popular in Big Data because they can be used to gain more information (insight) into the data because of the edges. Examples of graph databases are:
Object databases are exactly what the phrase implies -- a combination of a database with objects where objects are as used in object-oriented programming. Because of this combination, they belong to a distinct group of databases. Typically these databases integrate the database capability with object-oriented programming, so the database is usually tightly coupled with the object language, which can limit general applicability. Examples of Object databases for Big Data are the following:
This class of databases is named such because of its ability to provide multiple techniques depending on what you want to achieve. For example, the database could be a Document Store or a Graph Database or both. Hence, the name of multi-model databases. Examples of these databases are:
In the next article in this series, I'll present how these classes of databases can use Hadoop for data storage. I'll also discuss how these databases, while very important, are not the end-all to Big Data. You need tools to analyze the data to create information (and sometimes knowledge).
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.