The term “Big Data” is a too-often used buzzword to mean many things in many different industries. In business analytics, Big Data often means using the information that the business received from customers, sales forecast, suppliers and many other inputs to make optimal decisions about the direction for the business, both short and long term. Commodities traders might use Big Data in a completely different way — perhaps they would seek analytics from climate data, which requires looking at satellite and other image data along with textual information — to decide what trades to make long or short. These examples require a completely different set of analysis tools, and to be done efficiently, a completely different type of computational and storage environment, as the algorithms that process the data to turn it into information are so different.
Recently, Jeff Layton and I met over dinner and discussed a number of different types of algorithms from graph analysis, to MapReduce, to image change detection, and others, along with frameworks such as NOSQL and the architectures that run these algorithms most efficiently. Of course, there are specialized appliances from many vendors and many more are on the way. So what Big Data means to me is the process of changing data into information, which then changes into knowledge.
This is not a new phenomenon. My favorite quote, from about 400 years ago, per Sir Francis Bacon, is “knowledge is power.” As we move to extract more and more information and knowledge from our data, Jeff and I believe there is going to be a major change in the way systems are architected. You will no longer have things like static archives without the information from the archives being extracted and kept separately.
Jeff and I discussed how to approach the problem and what types of things would be important to use as well as important as you move into this new age of computing. During dinner, we came up with the idea of approaching the Big Data problem from two different directions, top down and bottom up. Jeff and I discussed writing about the data itself and what can be extracted from what types of data and, at the other end of the spectrum, the hardware needed to analyze the data. Of course, we would meet in the middle to discuss the operating system, file systems and other system software needed in a “Big Data” architecture. With the approval of the editorial staff, Jeff and I are ready to embark on “Jeff and Henry’s Big Data Adventure.”
I will be starting with a discussion of the hardware needed for Big Data algorithms and issues for Big Data architectures. For example:
- What kind of architecture is needed to solve the MapReduce problems of the future, the graph problems of the future or the image change detection?
- Do you need SSDs, SAS drives or enterprise SATA drives?
- What type of storage controllers are needed?
- What are the critical data archiving issues?
- In the future, what types of interfaces are needed — SAS, Fibre Channel Ethernet or something else?
- Will the planned CPUs meet the requirements, or do you need GPGPUs, FPGAs or something less obvious?
- What about memory requirements? Are the future DDR-3/4/5 memory plans going to meet the needs?
- Do you need memory hierarchies and larger memory footprints, such as machines connected via extending the CPU channels like the SGI Ultraviolet, or specialized memory systems and processors, such as the Cray uRIKA?
- Do CPUs built with cache coherency checking and cache coherency bandwidth make sense for the types of data analytics you need?
- Are operating systems up to the task of addressing the applications and the underlying hardware?
- What about languages, compilers, debuggers and the whole eco-system needed to run the system hardware?
- Don’t forget about the security requirements because now that data has become information and created knowledge, how do you keep the information from your competitors, enemies, and your employees who should not have access to the knowledge?
Maybe you want some users to see some things and other users to see only anonymized data. Hospital patient records are a prime example of this; you might not want anyone but the doctor to see the actual patient record, but the research team might need to see the disease and treatment options and results. Security is going to be a huge issue, as information is created and consolidated in a single location. This will be a honey pot for hackers, regardless of whether it is personal privacy data or corporate secrets. Not everyone should be able to see everything, and everything will need to be tracked such that there is an audit trail.
Questions to ask here include:
- What about the applications that will need to run on these systems?
- Will some queries have a higher priority than others?
- How will the applications write data so that it can be easily read for processing?
- How many threads are needed for applications, and will a parallel programming model be needed? If so, what will be the programming model, or will an SMP model be needed, and what programming model will be used?
- Can the application take any shortcuts on processing to get, say a 90 percent answer with 50 percent of the computational processing? Is 90 percent good enough for the answer in the time frame time that the answer is given, or are you making a life and death decision, in which case 90 percent is not good enough.
It is a good thing that the waitress was slow and the dinner was good, or Jeff and I would not have had enough time to talk about all of these issues. Of course, we did not come to any conclusions. Since Jeff and I were at a conference, we talked about the issues during the next few days and decided that “Big Data” would be the topic for our second annual joint writing project.
How We Will Tackle Big Data
For the next few months, I will work up the stack and address the Big Data problem, starting at the hardware and moving up the stack. As I have said time and time again, the minutia matters (at least some of the time). Jeff will start at the other end and work to the middle of the stack. We will meet somewhere in the operating system or compilers and libraries.
You might be asking why a storage site is talking about compilers, debuggers and the like, and why would I want to read about this stuff? Good question — the answer is that we are going to be seeing a shift in our world from data-oriented processing to information-oriented processing. Everything is going to change, and we do not want our readers to go the way of the dinosaur. We believe that with this paradigm shift, it is critical to understand that major changes in how things are considered is starting to happen. Storage is part of the equation, but to be successful you will need to have a good understanding of not just storage but also the new operating environment and its requirements.
This is not to say that we believe you must be an expert in everything listed above, because no one person can be, or even should try to be, but it does mean that to be successful, you must be aware and understand how each of the items listed above, and likely some other things that I have not thought about or have not be created yet, fit into the framework of the future. Big Data is not about storage in the cloud. Nor is it about archive vs. backup or other tactical issues. It is about taking what you have and extracting the information to help your organization be more successful.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn’t require diplomatic skills. Diplomacy’s loss was HPC’s gain.
Jeff Layton is the Enterprise Technologist for HPC at Dell, Inc., and a regular writer of all things HPC and storage.