The Need for Parallel Cloud Storage - Page 3
Implications — Parallel Cloud Storage
There are two important implications of the increase in data. The first is that people are going to be doing much more analysis on different varieties of data. They won't be doing a single large analysis across a bunch of data. Rather, they will be running multiple analyses, asking different questions, resulting in accessing the data from different servers.
For example, one person or one team could be using video data as part of an analysis of traffic patterns in the parking lot to better understand how the traffic flow for either a redesign of the lot or to reduce insurance costs. A second person or second team, could be using the same video data to measure how long people spent inside the store, which is then correlated to how much was spent during that time ($/minute). They are both equally important and they are both utilizing the same data.
A second implication is that these data files are going to become even larger. Will the files be too big for the memory of a single node? Will the processing have to split across nodes? At this point, performance is likely to be driven by the performance of data access by a single node (single core in some cases).
I think both implications point to the fact that cloud data access and cloud data storage will have to go parallel. Leaving things in HDFS keeps the bottleneck on local data access performance. You can keep increasing the performance of local data access by using more disks, more controllers, or switching to SSD drives. All of this points to more and more cost and, more importantly, it is per node! The same is true for other Cloud file systems (S3, Swift, etc.). Moreover, file systems that rely on a REST interface are not likely to be useful for this situation because they do not allow moving a file pointer to a different location in the file for reading or writing data.
What is needed is a parallel access cloud file system that is (most likely) not based on REST/SOAP. These file systems allow applications to open a file, read and write to it, move the file pointer (lseek) and do other useful file operations. To prove out this point, examine the file systems the largest HPC systems are using. You will see some file systems repeated over and over, such as Lustre, GPFS and Panasas. These file systems allow IO from multiple threads to a single file (Note - these threads can be running on multiple nodes).
Whatever the solution ends up being, what is clear is that cloud workloads are going to be pushing the boundaries of what can be done with a REST interface and are going to need the ability to do parallel IO. We need to rethink what we're doing in the cloud storage world and start considering parallel cloud storage.