Top of the Big Data Stack: The Importance of the Software Stack
Henry Newman and I have undertaken the task of examining Big Data and what it really means. It's a buzzword. Like many buzzwords, it has been beaten to death yet contains a kernel of real usefulness, technology, and ideas.
We've decided to tackle Big Data by doing some “ground and pound” on the subject and finding the kernels of truth and what they mean for storage solutions.
Henry kicked off the series with a great introduction, including what I consider to be the best definition of Big Data I've seen. Hence, I will repeat it, yet again:
Big Data is the process of changing data into information, which then changes into knowledge.
Henry and I chose to tackle the discussion by coming from two different directions. Henry is starting at the very bottom of the stack with the hardware itself and then moving up through the stack. More precisely, what aspects of hardware are important for Big Data and what technologies are important?
I'm starting at the top of the Big Data stack with the applications and then moving down through the stack. We'll meet somewhere in the middle and collect our ideas and comments into a final article.
I'm now ready to go deeper, underneath the application layer. Typically people are most worried about the application itself and less about the software underpinnings. But as you can probably tell, these software underpinnings can be a blessing, a curse, or even both. To better understand this, let's drill down, starting with the application languages and going down to the drivers (I'll leave the device firmware discussion to Henry - he loves that stuff).
Today the Big Data world is running what are mostly serial applications on local storage (Hadoop) and are using task parallelism from things such as mapreduce to improve performance. But at some point this approach will hit a hard limit and quit scaling at that point. New ideas will have to be used to improve performance.
Thing such as application parallelism and parallel IO will have to be used to improve performance. But these ideas are for naught if the application languages don't support them.
The application itself is written in some sort of language. In the Big Data world, Java is the dominant programming language.
Recall that a large portion of Big Data is focused on performance and a great deal of this performance comes from task parallelism. But task paralleism can only carry performance so far before you are limited to a few serial jobs running in parallel. Moreover, Big Data applications are growing in size and complexity all of the time. The coupling of growth with finite task parallelism can easily be a boat anchor to performance, requiring developers to start thinking about how to run specific tasks in parallel.
Consequently, paying attention to the parallel aspects of the application language is critical to performance. For example, does the language have any parallel aspects or does it rely solely on add-on libraries such as MPI (Message Passing Interface)? Is the parallelism easy to express or does it require the proverbial patting of one's head while rubbing your stomach?
An example of a programming language that might be useful for Big Data is X10, a language developed by IBM as part of the DARPA project on Productive, Easy-to-Use, Reliable Compute System (PERC).
A core part of the language is around easy to write parallelism. X10 achieves underlying parallelism through something called PGAS or Partitioned Global Address Space. PGAS languages use a global name space (no replicated data requirements), but the space can be logically partitioned with a portion of the space local to each processor. This is extended to include the idea that portions of the address space may have an affinity for a specific thread -- meaning that these threads can then be run locally.
In the case of X10, the computation is divided among a set of places where each "place" contains some data and hosts one or more "activities" that operate on the data. In essence you can think of a "place" as PGAS memories and an "activity" as a thread.