Henry Newman and I have undertaken the task of examining Big Data and what it really means. It's a buzzword. Like many buzzwords, it has been beaten to death yet contains a kernel of real usefulness, technology, and ideas.
We've decided to tackle Big Data by doing some “ground and pound” on the subject and finding the kernels of truth and what they mean for storage solutions.
Henry kicked off the series with a great introduction, including what I consider to be the best definition of Big Data I've seen. Hence, I will repeat it, yet again:
Big Data is the process of changing data into information, which then changes into knowledge.https://o1.qnsr.com/log/p.gif?;n=203;c=204660761;s=10655;x=7936;f=201812281257540;u=j;z=TIMESTAMP;a=20400368;e=iHenry and I chose to tackle the discussion by coming from two different directions. Henry is starting at the very bottom of the stack with the hardware itself and then moving up through the stack. More precisely, what aspects of hardware are important for Big Data and what technologies are important?
I'm starting at the top of the Big Data stack with the applications and then moving down through the stack. We'll meet somewhere in the middle and collect our ideas and comments into a final article.
I'm now ready to go deeper, underneath the application layer. Typically people are most worried about the application itself and less about the software underpinnings. But as you can probably tell, these software underpinnings can be a blessing, a curse, or even both. To better understand this, let's drill down, starting with the application languages and going down to the drivers (I'll leave the device firmware discussion to Henry - he loves that stuff).
Today the Big Data world is running what are mostly serial applications on local storage (Hadoop) and are using task parallelism from things such as mapreduce to improve performance. But at some point this approach will hit a hard limit and quit scaling at that point. New ideas will have to be used to improve performance.
Thing such as application parallelism and parallel IO will have to be used to improve performance. But these ideas are for naught if the application languages don't support them.
The application itself is written in some sort of language. In the Big Data world, Java is the dominant programming language.
Recall that a large portion of Big Data is focused on performance and a great deal of this performance comes from task parallelism. But task paralleism can only carry performance so far before you are limited to a few serial jobs running in parallel. Moreover, Big Data applications are growing in size and complexity all of the time. The coupling of growth with finite task parallelism can easily be a boat anchor to performance, requiring developers to start thinking about how to run specific tasks in parallel.
Consequently, paying attention to the parallel aspects of the application language is critical to performance. For example, does the language have any parallel aspects or does it rely solely on add-on libraries such as MPI (Message Passing Interface)? Is the parallelism easy to express or does it require the proverbial patting of one's head while rubbing your stomach?
An example of a programming language that might be useful for Big Data is X10, a language developed by IBM as part of the DARPA project on Productive, Easy-to-Use, Reliable Compute System (PERC).
A core part of the language is around easy to write parallelism. X10 achieves underlying parallelism through something called PGAS or Partitioned Global Address Space. PGAS languages use a global name space (no replicated data requirements), but the space can be logically partitioned with a portion of the space local to each processor. This is extended to include the idea that portions of the address space may have an affinity for a specific thread -- meaning that these threads can then be run locally.
In the case of X10, the computation is divided among a set of places where each "place" contains some data and hosts one or more "activities" that operate on the data. In essence you can think of a "place" as PGAS memories and an "activity" as a thread.
In keeping with the concept of a global space in PGAS implementations, X10 also has the concept of globally distributed arrays along with structured and unstructured arrays.
With X10 you also have fine-grained concurrency with the ability to use large distributed clusters that can be heterogeneous (recall that Big Data is designed for distributed computation). Perhaps even more important, it is interoperable with Java, which is the lingua franca of Big Data, with the exception of the analytics portion, which is dominated by R.
X10 has gone beyond classic Java to add a real parallel execution model as well as data sharing but at the same time, X10 understands the idea of a distributed system. As Big Data continues to grow, the importance of using languages such as X10 will becomes increasingly necessary.
For larger and more diverse data sets, you can no longer rely on a "local" language that executes on a single node. You will need a language that is designed for parallelism. While I'm not pushing X10, it does have an advantage in that it can interoperate with Java. So it should be relatively easy to write applications that interact with existing NoSQL and Hadoop applications.
There are other efforts at parallelizing applications using MPI. For example SAS is already using MPI (Message Passing Interface) in some of the applications to improve performance.
Even if we use a language such as X10 for writing parallel Big Data applications, at some point it will be become painfully obvious that relying on what are essentially "local" file systems will also become an impediment to scaling performance.
The software issues do not stop with just the application language because Big Data is really about – you guessed it – data.
Right now a majority of the Big Data world really does everything in serial data access patterns. As an example, let's consider an analytics application written in R. The application is basically single threaded since R is primarily single-threaded but it uses a NoSQL database to access the data (note that the database itself can easily be distributed).
The database then accesses the data that is stored in Hadoop. Using mapreduce for task parallelism, the database may access different sets of data or even the same data, on difference nodes.
However, remember that the Hadoop/mapreduce model really allows the application to access the data only on the node where the data is located. So you are locked into the data access performance of a single node (i.e. "local" data access). At some point the Big Data world is going to discover that task parallelism only gets them so far and that they will have to start thinking about parallel data access.
Recall that when a job runs within a Hadoop environment it is assigned to a node where the data is located. The job(s) are started on that node and all data access is done on that node alone (again, "local" access).
Each node in the system has some sort of direct attached storage (DAS) that is now the bottleneck for data access performance. A single data request from the application goes to the Hadoop file system, which makes a serial data request to the underlying storage, which uses RAID across several disks to get better performance.
The point is that the data access is limited by the performance of the local storage system that is attached to that node. The only way to temporarily push the bottleneck somewhere else is to start throwing lots of storage hardware at each node, which will get expensive very quickly.
To avoid spending too much on hardware just to gain some temporary storage enhancement, what is needed is for the underlying file system and the API for accessing that file system to allow parallel data access. A classic example of this is from the HPC (High Performance Computing) field in the form of MPI-IO.
To quote the link, "The purpose of MPI/IO is to provide high performance, portable, parallel I/O interfaces to high performance, portable, parallel MPI programs." If you strip away the plethora of adjectives, what MPI/IO provides is a set of functions for MPI programs that do parallel IO.
This means that either a serial or parallel application (MPI does not restrict you to only parallel applications) can perform parallel IO to a single file. The critical design point for MPI/IO is to provide a high-speed interface for program's checkpoint in the event of a system failure or to output data for post processing. This is significantly different than what might be needed for a Big Data algorithm.
In these file systems the data is striped or otherwise distributed across a number of storage nodes. So when data is accessed it is possible that all of the data servers will locally access their portion of the data, assemble the data in the proper order (there are many ways to do this), and the resulting stream of data is returned to the application. This allows the data to be accessed in parallel even if the data request is serial. This file system model is very different from Hadoop. Either Hadoop will need to be adapted or something new will replace it.
Today we have Hadoop and the applications that use it, accessing data very serially. Let's assign that a performance of 1. If had parallel applications performing parallel IO then we can have n processes accessing the data. If the hardware can keep up then we have a performance of n where nis the number of processes running in parallel.
Then if there are m data servers for a parallel file system and all of them can run at full speed, then the performance for an application that is parallel and accessing data on a parallel file system is n x m. For very moderate values of n and mthe performance goes up quite dramatically.
A simple example of a file system that has 4 data servers (m=4), and an application that runs on every core in a 16-core node (n=16), means that we can theoretically get a speedup of 64 relative to what we can do today with Hadoop and serial applications. Moreover, if we keep today's task parallelism in place while adding application and file system parallelism, then we get an even larger boost in performance.
Big Data OS
Underlying the application layer, the application languages, and parallel data access, lies something equally important, the OS (Operating System). When a data request hits the OS there are a number of things that happen to it.
OS's are all different but in general, there is an IO scheduler that schedules and retires IO requests based on some algorithm. This scheduler will also try to do things such as combine neighboring data requests to make the data requests so a single read or write can satisfy all of the combined requests. This can improve throughput but it also increases latency, reducing the apparent IOPS (Input Output Operations Per Second) performance.
However, IOPS can be very important to Big Data applications. Remember that data is stored in some fashion such as key-value pairs, so that if you need a certain piece of information, the data access is likely to force a seek to somewhere else on the storage media. This increases latency and the total time to access a very small bit of data. This data access pattern can happen quite often in Big Data, to the point where random IOPS performance of the underlying storage is the bottleneck for application performance.
There are some approaches you can use to help data access patterns such as wide column store databases, but as you scale the data in the database either in size (total capacity) or amount of data (number of records), IOPS will become an increasingly important aspect of performance.
As mentioned earlier this is driven by various pieces in the OS including context switching, which Henry Newman has written about. The need for more IOPS puts tremendous pressure on the OS. Coupled with this are SSD (Solid State Drive) manufacturers claiming close to one million IOPS from a device.
How does an OS handle this many IO requests? This is going to force a reexamination of how an OS handles very large IOPS. This means that we as a community will need to push the OS writers to rethink or adapt to the needs of Big Data.
Device drivers are one of those wondrous things that just seem to happen. New hardware comes out and – bingo! – there are device drivers that allow the OS to communicate with the new hardware. But it's really a "duck" situation. That is, on the surface, the duck (device drivers) looks calm and serene and underneath the water, their legs are paddling furiously (the device driver writers).
There is new hardware coming out all the time that requires people to write device drivers or otherwise the hardware is worthless. But perhaps worse, hardware vendors love to tweak their hardware ever so slightly, which may or may not break existing device drivers, causing the driver to be modified or even re-written.
In my opinion the unsung heroes of an OS are the device driver authors. It is definitely not an easy task and requires great amount of coordination and testing.
How do device drivers impact Big Data? Besides the obvious issue of making sure you have the correct and up to date device drivers in place, in my opinion there is one key problem the Big Data community has not come to grip with - and that is the operation and administration of large distributed systems.
Instead of a database running on a single server or a small cluster of servers, you now have NoSQL databases running across distributed systems that can run into the hundreds of nodes. Moreover, you also have Big Data systems that are growing by adding more nodes with storage or by adding more storage to existing nodes. Plus, some day, you will have databases running across data centers that are not even in the same continent.
All of the systems need to have the most current and the correct set of device drivers to function properly but the underlying hardware within the system may be different even if it is from the same manufacturer. If you have been a system administrator, you know what happens when you add a new piece of hardware to the system - things immediately break. So how do you manage, monitor, and administer these possibly heterogeneous systems?
The HPC community has been working on this problem for several decades now. There are some robust tool sets that allow systems to be remotely deployed, managed, and monitored. There is no secret "admin voodoo" that you need to make this all happen. You do need good management and organizational skills with an eye for attention to detail. You may have 16 nodes from the same vendor but 15 of them have identical network cards and the 16th node has a slightly different version. You have to catalog and document this difference and make sure the tools you’re using can use this card correctly.
While I didn't really talk about the details of device drivers (these discussions happen in dark mysterious places), the impact of device drivers on a Big Data system as a whole can be large. I don't believe the Big Data world has really thought this problem and are writing their own new set of tools for deploying, managing, and monitoring multiple systems that support a single application. My advice is to talk to the HPC folks and ask them how they do it. I think you will be surprised at what that have been able to accomplish.
Big Data is all about taking data, creating information from it, and turning that information into knowledge. Big Data applications take data from various sources and run user applications in the hope of producing this information (knowledge usually comes later).
But, as the term implies, Big Data can involve a great deal of data. Consequently, Big Data developers need to start thinking about how they will scale their applications as the amount of data grows. Consequently, you have to consider parallelism in the Big Data stack.
While many people are focused on the application aspect of Big Data, the underpinnings such as the programming languages, data access techniques, the OS, and management tools can all have a profound impact on the overall performance. As the title says, "no one expects the software stack".