Why Use MapReduce? - Page 3
At the beginning of this article I used Henry's definition for Big Data, which means taking data and turning it into information (as a first step). Given that Big Data usually implies lots of data (but not always), this means that there could be lots of processing. You are not likely to run one application against a single set of data producing some end result. Instead, you are likely to run various different analyses against multiple data sets with various parameters and collect information (results) from each and store those in a database for further processing. This implies a large number of runs over different, potentially large, data sets, resulting in lots of results. How do you coordinate and configure all of these runs?
One way to do this is to use something called MapReduce. In general terms, MapReduce is a framework for embarrassingly parallel computations that use potentially large data sets and a large number of nodes. Ideally, it also uses data that is stored locally on a particular node where the job is being executed. The computations are embarrassingly parallel because there is no communication between them. The run independent of one another.
As the name implies, MapReduce has two steps. The first step, the "Map" step, takes the input and breaks it into smaller sub-problems and distributes them to the worker nodes. The worker nodes then send their results back to the "master" node. The second step, the "Reduce" step, takes the results from the worker nodes and combines them in some manner to create the output, which is the output for the original job.
As you can tell from the description, MapReduce deals with distributed processing for both steps, but remember, the processing is designed to be embarrassingly parallel. This is where MapReduce gets performance, performing operations in parallel. To get the most performance means that there is no communication between worker nodes, so no data is really shared between them (unlike HPC applications, which are MPI-based and can potentially share massive amounts of data).
However, there could be situations where the mapping operations spawn other mapping operations, so there is some communication between them, resulting in not so embarrassingly parallelism. Typically, these don't involve too much internode communication. In addition, parallelism can be limited by the number of worker nodes that have a copy of the data. If you have five nodes needing access to the same data file, but you have only three copies, two nodes will have to pull the data from a different worker node. This results in reduced parallelism and reduced performance. This is true for both the Map phase and the Reduce phase. On the other hand, three copies of the data allows three applications to access the same data, unlike serial applications where there is only one copy of the data.
At first glance, one would think MapReduce was fairly inefficient because it must break up the problem, distribute the problem (which may be sub-divided yet again), and then assemble all of the results from the worker node to create the final answer. That seems like a great deal of work just to set up the problem and execute it. For small problems, this is definitely true -- it's faster to execute the application on a single node than to use MapReduce.
Where MapReduce shines is in parallel operations that require a great deal of computational time on the worker nodes or the assembler nodes and for large data sets. If I haven't said it clearly enough, the "magic" of MapReduce is exploiting parallelism to improve performance.
Traditionally, databases are not necessarily designed for fault-tolerance when run in a clustered situation. If you lose a node in the cluster, then you have to stop the job, check the file system and database, then restart the database on fewer nodes and rerun the application. NoSQL databases, and most of the tools like it, were primarily designed for two things: 1) performance, particularly around data, and 2) fault-tolerance. One way that some of the tools get fault-tolerance is to use Hadoop as the underlying file system. Another way to achieve fault-tolerance is to make MapReduce fault-tolerant as well.
Remember, MapReduce breaks problems into smaller sub-problems and so on. It then takes the output from these sub-problems and assembles them into the final answer. This is using parallelism to your advantage in running your application as quickly as possible. MapReduce usually adds fault-tolerance because if a task fails for some reason, then the job scheduler can reschedule the job if the data is still available. This means MapReduce can recover from the failure of a datanode (or several) and still be able to complete the job.
Many times people think of Hadoop as a pure file system that you can use as a normal file system. However, Hadoop was designed to support MapReduce from the beginning, and it's the fundamental way of interacting with the file system. Applications that interact with Hadoop can use an API, but Hadoop is really designed to use MapReduce as the primary method of interaction. The coupling of multiple data copies with the parallelism of the MapReduce produces a very scale-out, distributed and fault-tolerant solution. Just remember that the design allows for nodes to fail without interrupting the processing. This means that you can also add datanodes to the system, and Hadoop and MapReduce will take advantage of them.