Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
In May, Henry kicked off a collaborative effort to examine some of the details behind the Big Data push and what they really mean. This article is the third in our muiltipart series and the second of three to take a high-level examination of Big Data from the top of the stack -- that is, the applications.
Henry and I have undertaken the task of examining Big Data and what it really means. It's a buzzword that, like many buzzwords, has been beaten to death, yet contains a kernel of real usefulness and technology. We've decided to tackle Big Data, throw it into the turnbuckle, and find the kernels of truth and what they mean for storage solutions.
Henry kicked off the series with a great introduction, including what I consider to be the best definition for Big Data.
|Big Data is the process of changing data into information, which then changes into knowledge.|
This definition is so appropriate because the adjective "Big" can mean many things to many fields of interest. So rather than focus on what some people may think is "Big" for their particular field, we can instead focus on what you do with the data and why.
Henry and I have chosen to tackle the discussion by coming from two different directions. Henry started at the very bottom with the hardware itself and then moved up through the stack. More precisely, what aspects of hardware are important for Big Data and what technologies are important. I'm starting at the top of the Big Data stack with the applications and then moving down through the stack. We'll meet somewhere in the middle and collect our ideas and comments into a final articles.
Starting at the top isn't easy, and my original article became rather lengthy, so we broke it into three parts. The first part discusses some fundamental issues at the top of the stack, including the importance of getting the data into a storage system for use. This step is actually more important than most people realize. The article also discusses the most common tools for Big Data -- NoSQL databases. At the end of the first article, these tools were broken into eight groups.
- Wide Column Store/Column Families
- Document Store
- Key Value/Tuple Store
- Graph Databases
- Multimodel Databases
- Object Databases
- Multivalue databases
- RDF databases
In this article, the second part to my top-level series, I dive into each of these eight classes, with an eye toward what it means for data storage.
Wide Column Store/Column Families
Some people will refer to Wide Column Stores or Column Families as "BigTable Clones." In many ways, BigTable is the grandfather (or grandmother) of Big Data NoSQL databases. It is a proprietary application that Google has built on top of its other technologies, such as Map-Reduce, the Google File System, Chubby File Locking, SSTable and some other bits.
The design of BigTable is fairly simple. It maps two arbitrary string values (row key and column key) and a timestamp into an associated arbitrary byte array. Think of it as a sparsely distributed multi-dimensional sorted map, if that phrase makes things easier for you. It is designed to scale into the Petabyte range across hundreds or thousands of servers with the ability to easily add servers to scale resources (compute and storage). Consequently, the design of BigTable is focused on performance and scalability at the expense of other aspects.
Some of you may ask why databases are column oriented instead of the normal row orientation. Great question! Classic row-oriented databases have data in rows like the following:
In essence, a CSV version of the data looks like the following:
1,Johnny,68 2,Lisa,59 3,Rick,78
A column-oriented database transposes the data orientation, so that it looks like the following:
1,2,3 Johnny,Lisa,Rick 68,59,78
You can see how the data has been transformed into a column-oriented database.
The benefit of column-oriented databases is all-around performance and data access patterns from the hardware (hence, Henry's bottom up approach in this series). From the Wikipedia article about column-oriented databases, the general benefits of column oriented databases are:
- "Column-oriented databases are typically more efficient when an aggregate function needs to be computed over many rows but only for a subset of all of the columns of data because reading that smaller subset of data can be faster than reading all data"
- "Column-oriented databases are more efficient when new values of a column are supplied for all rows at once because that column data can be written efficiently and replace old column data without touching any other columns for the rows."
For column stores, the workloads can typically be a fairly small number of very complex queries over similar data, as some Big Data workloads tend to perform. Examples of this are data warehouses, customer relationship management (CRM) systems and library card catalogs. However, column-oriented databases are not perfect, and there are some trade-offs.
There are several example of column-oriented databases or data stores: