Top of the Big Data Stack: The Importance of the Software Stack! - Page 4
Instead of a database running on a single server or a small cluster of servers, you now have NoSQL databases running across distributed systems that can run into the hundreds of nodes. Moreover, you also have Big Data systems that are growing by adding more nodes with storage or by adding more storage to existing nodes. Plus, some day, you will have databases running across data centers that are not even in the same continent.
All of the systems need to have the most current and the correct set of device drivers to function properly but the underlying hardware within the system may be different even if it is from the same manufacturer. If you have been a system administrator, you know what happens when you add a new piece of hardware to the system - things immediately break. So how do you manage, monitor, and administer these possibly heterogeneous systems?
The HPC community has been working on this problem for several decades now. There are some robust tool sets that allow systems to be remotely deployed, managed, and monitored. There is no secret "admin voodoo" that you need to make this all happen. You do need good management and organizational skills with an eye for attention to detail. You may have 16 nodes from the same vendor but 15 of them have identical network cards and the 16th node has a slightly different version. You have to catalog and document this difference and make sure the tools you’re using can use this card correctly.
While I didn't really talk about the details of device drivers (these discussions happen in dark mysterious places), the impact of device drivers on a Big Data system as a whole can be large. I don't believe the Big Data world has really thought this problem and are writing their own new set of tools for deploying, managing, and monitoring multiple systems that support a single application. My advice is to talk to the HPC folks and ask them how they do it. I think you will be surprised at what that have been able to accomplish.
Big Data is all about taking data, creating information from it, and turning that information into knowledge. Big Data applications take data from various sources and run user applications in the hope of producing this information (knowledge usually comes later).
But, as the term implies, Big Data can involve a great deal of data. Consequently, Big Data developers need to start thinking about how they will scale their applications as the amount of data grows. Consequently, you have to consider parallelism in the Big Data stack.
While many people are focused on the application aspect of Big Data, the underpinnings such as the programming languages, data access techniques, the OS, and management tools can all have a profound impact on the overall performance. As the title says, "no one expects the software stack".