Top of the Big Data Stack: The Importance of the Software Stack - Page 2


Want the latest storage insights?

Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure

Share it on Twitter  
Share it on Facebook  
Share it on Google+
Share it on Linked in  

In keeping with the concept of a global space in PGAS implementations, X10 also has the concept of globally distributed arrays along with structured and unstructured arrays.

With X10 you also have fine-grained concurrency with the ability to use large distributed clusters that can be heterogeneous (recall that Big Data is designed for distributed computation). Perhaps even more important, it is interoperable with Java, which is the lingua franca of Big Data, with the exception of the analytics portion, which is dominated by R.

X10 has gone beyond classic Java to add a real parallel execution model as well as data sharing but at the same time, X10 understands the idea of a distributed system. As Big Data continues to grow, the importance of using languages such as X10 will becomes increasingly necessary.

For larger and more diverse data sets, you can no longer rely on a "local" language that executes on a single node. You will need a language that is designed for parallelism. While I'm not pushing X10, it does have an advantage in that it can interoperate with Java. So it should be relatively easy to write applications that interact with existing NoSQL and Hadoop applications.

There are other efforts at parallelizing applications using MPI. For example SAS is already using MPI (Message Passing Interface) in some of the applications to improve performance.

Even if we use a language such as X10 for writing parallel Big Data applications, at some point it will be become painfully obvious that relying on what are essentially "local" file systems will also become an impediment to scaling performance.

Parallel IO

The software issues do not stop with just the application language because Big Data is really about – you guessed it – data.

Right now a majority of the Big Data world really does everything in serial data access patterns. As an example, let's consider an analytics application written in R. The application is basically single threaded since R is primarily single-threaded but it uses a NoSQL database to access the data (note that the database itself can easily be distributed).

The database then accesses the data that is stored in Hadoop. Using mapreduce for task parallelism, the database may access different sets of data or even the same data, on difference nodes.

However, remember that the Hadoop/mapreduce model really allows the application to access the data only on the node where the data is located. So you are locked into the data access performance of a single node (i.e. "local" data access). At some point the Big Data world is going to discover that task parallelism only gets them so far and that they will have to start thinking about parallel data access.

Recall that when a job runs within a Hadoop environment it is assigned to a node where the data is located. The job(s) are started on that node and all data access is done on that node alone (again, "local" access).

Each node in the system has some sort of direct attached storage (DAS) that is now the bottleneck for data access performance. A single data request from the application goes to the Hadoop file system, which makes a serial data request to the underlying storage, which uses RAID across several disks to get better performance.

The point is that the data access is limited by the performance of the local storage system that is attached to that node. The only way to temporarily push the bottleneck somewhere else is to start throwing lots of storage hardware at each node, which will get expensive very quickly.

To avoid spending too much on hardware just to gain some temporary storage enhancement, what is needed is for the underlying file system and the API for accessing that file system to allow parallel data access. A classic example of this is from the HPC (High Performance Computing) field in the form of MPI-IO.

To quote the link, "The purpose of MPI/IO is to provide high performance, portable, parallel I/O interfaces to high performance, portable, parallel MPI programs." If you strip away the plethora of adjectives, what MPI/IO provides is a set of functions for MPI programs that do parallel IO.

This means that either a serial or parallel application (MPI does not restrict you to only parallel applications) can perform parallel IO to a single file. The critical design point for MPI/IO is to provide a high-speed interface for program's checkpoint in the event of a system failure or to output data for post processing. This is significantly different than what might be needed for a Big Data algorithm.

In addition to applications that perform parallel IO you can also have an underlying file system that is parallel. Examples of this include GPFS from IBM, Lustre, and Panasas.

In these file systems the data is striped or otherwise distributed across a number of storage nodes. So when data is accessed it is possible that all of the data servers will locally access their portion of the data, assemble the data in the proper order (there are many ways to do this), and the resulting stream of data is returned to the application. This allows the data to be accessed in parallel even if the data request is serial. This file system model is very different from Hadoop. Either Hadoop will need to be adapted or something new will replace it.

Today we have Hadoop and the applications that use it, accessing data very serially. Let's assign that a performance of 1. If had parallel applications performing parallel IO then we can have n processes accessing the data. If the hardware can keep up then we have a performance of n where nis the number of processes running in parallel.

Then if there are m data servers for a parallel file system and all of them can run at full speed, then the performance for an application that is parallel and accessing data on a parallel file system is n x m. For very moderate values of n and mthe performance goes up quite dramatically.

A simple example of a file system that has 4 data servers (m=4), and an application that runs on every core in a 16-core node (n=16), means that we can theoretically get a speedup of 64 relative to what we can do today with Hadoop and serial applications. Moreover, if we keep today's task parallelism in place while adding application and file system parallelism, then we get an even larger boost in performance.

Submit a Comment


People are discussing this article with 0 comment(s)