Everyone is talking about Big Data, and this popular technology has implications up and down the storage stack. This new series will look at Big Data from each end of the spectrum -- from data extraction to the hardware needed to analyze the data, with OS, file systems and other system software in the middle.
The term "Big Data" is a too-often used buzzword to mean many things in many different industries. In business analytics, Big Data often means using the information that the business received from customers, sales forecast, suppliers and many other inputs to make optimal decisions about the direction for the business, both short and long term. Commodities traders might use Big Data in a completely different way -- perhaps they would seek analytics from climate data, which requires looking at satellite and other image data along with textual information -- to decide what trades to make long or short. These examples require a completely different set of analysis tools, and to be done efficiently, a completely different type of computational and storage environment, as the algorithms that process the data to turn it into information are so different.
Recently, Jeff Layton and I met over dinner and discussed a number of different types of algorithms from graph analysis, to MapReduce, to image change detection, and others, along with frameworks such as NOSQL and the architectures that run these algorithms most efficiently. Of course, there are specialized appliances from many vendors and many more are on the way. So what Big Data means to me is the process of changing data into information, which then changes into knowledge.
This is not a new phenomenon. My favorite quote, from about 400 years ago, per Sir Francis Bacon, is "knowledge is power." As we move to extract more and more information and knowledge from our data, Jeff and I believe there is going to be a major change in the way systems are architected. You will no longer have things like static archives without the information from the archives being extracted and kept separately.
Jeff and I discussed how to approach the problem and what types of things would be important to use as well as important as you move into this new age of computing. During dinner, we came up with the idea of approaching the Big Data problem from two different directions, top down and bottom up. Jeff and I discussed writing about the data itself and what can be extracted from what types of data and, at the other end of the spectrum, the hardware needed to analyze the data. Of course, we would meet in the middle to discuss the operating system, file systems and other system software needed in a "Big Data" architecture. With the approval of the editorial staff, Jeff and I are ready to embark on "Jeff and Henry's Big Data Adventure."
I will be starting with a discussion of the hardware needed for Big Data algorithms and issues for Big Data architectures. For example:
Maybe you want some users to see some things and other users to see only anonymized data. Hospital patient records are a prime example of this; you might not want anyone but the doctor to see the actual patient record, but the research team might need to see the disease and treatment options and results. Security is going to be a huge issue, as information is created and consolidated in a single location. This will be a honey pot for hackers, regardless of whether it is personal privacy data or corporate secrets. Not everyone should be able to see everything, and everything will need to be tracked such that there is an audit trail.
Questions to ask here include:
It is a good thing that the waitress was slow and the dinner was good, or Jeff and I would not have had enough time to talk about all of these issues. Of course, we did not come to any conclusions. Since Jeff and I were at a conference, we talked about the issues during the next few days and decided that "Big Data" would be the topic for our second annual joint writing project.
For the next few months, I will work up the stack and address the Big Data problem, starting at the hardware and moving up the stack. As I have said time and time again, the minutia matters (at least some of the time). Jeff will start at the other end and work to the middle of the stack. We will meet somewhere in the operating system or compilers and libraries.
You might be asking why a storage site is talking about compilers, debuggers and the like, and why would I want to read about this stuff? Good question -- the answer is that we are going to be seeing a shift in our world from data-oriented processing to information-oriented processing. Everything is going to change, and we do not want our readers to go the way of the dinosaur. We believe that with this paradigm shift, it is critical to understand that major changes in how things are considered is starting to happen. Storage is part of the equation, but to be successful you will need to have a good understanding of not just storage but also the new operating environment and its requirements.
This is not to say that we believe you must be an expert in everything listed above, because no one person can be, or even should try to be, but it does mean that to be successful, you must be aware and understand how each of the items listed above, and likely some other things that I have not thought about or have not be created yet, fit into the framework of the future. Big Data is not about storage in the cloud. Nor is it about archive vs. backup or other tactical issues. It is about taking what you have and extracting the information to help your organization be more successful.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.