Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
The world is enamored with cloud computing, particularly cloud storage. Cloud storage is used for a variety of data tasks: performing IO for applications that are running in the cloud (S3 is an example), using Hadoop/MapReduce or some other analytical computation, storing data for later use but not archiving it, or storing data for archival purposes.
For all four of these use cases, the amount of data is growing very quickly — much faster than you might think. At the same time, the computational resources for processing the data are increasing.
For example, on Amazon EC2you can get the following:
- "High-Memory Quadruple Extra Large Instance" (m2.4xlarge)
- 68.4 GiB of memory
- 26 EC2 Compute Units (8 virtual cores with 3.25 EC2 Compute Units Each)
- 1,690 GB of storage
- "Cluster Compute Eight Extra Instance" (cc2.8xlarge)
- 60.5 GiB memory
- 88 EC2 Compute Units (2x Intel E5-2670, 8-core each)
- 3,370 GB storage
- "High Memory Cluster Eight Extra Instance" (cr1.8xlarge)
- 244 GiB memory
- 88 EC2 Compute Units (2x Intel E5-2670, 8-core each)
- 240GB SSD instance storage
- "Cluster GPU Quadruple Extra Large Instance" (cg1.4xlarge)
- 22 GiB memory
- 33.5 EC2 Compute Units (2x Intel X5560, 4-core each)
- 2x NVIDIA M2050 Tesla GPU
- 1,690 GB storage
- "High I/O Quadruple Extra Large Instance" (hi.4xlarge)
- 60.5 GiB memory
- 35 EC2 Compute Units (16 virtual cores)
- 2 SSD based volumes with 1,024 GB each
- Amazon states that you can achieve 120,000 random read IOPS, and 10,000-85,000 random write IOPS
- "High Storage Instance" (hs1.8xlarge)
- 117 GiB memory
- 35 EC2 Compute Units (16 virtual cores)
- 24 hard drives with 2TB of instance storage
- Amazon states that each instance can deliver 2.4 GiB/s of sequential read and 2.6 GiB/s of sequential write performance.
Notice that in some instances you are getting a great deal of compute power. However, you can also get quite a bit of storage, up to a little over 3.3TB. You can also opt for some instances where you can get around 2.5 GiB/s sequential performance or 120,000 random read IOPS or up to 85,000 random write IOPS.https://o1.qnsr.com/log/p.gif?;n=203;c=204650394;s=9477;x=7936;f=201801171506010;u=j;z=TIMESTAMP;a=20392931;e=i
All of these numbers are respectable, but they are for a single node. Currently, each instance has to have its own copy of the data, unless you create a storage solution using the instances (not always easy to achieve).
What happens if each instance needs to access more than 3.37TB? What if you need more storage than the SSD instances allow? How do you share data so that you don't have to copy it to each instance? What if you need more performance than is offered by these instances?
An equally important question, and one that usually goes unnoticed, is what if you need more single-node performance? If you look at the list of current Amazon instances, you will see that the fastest single node (single client) performance is about 2.4 GB/s. Sharing the file between servers only exacerbates the single-node IO throughput problem. The best way out of this problem, that I can see, is to start thinking about cloud storage in a parallel fashion.
Cloud Data Explosion — Video Example
I'm not sure if you've ever watched some of the television programs that have "World's Worst Driver" in the title, but they are very entertaining. After I stop laughing at the person backing up the highway because they missed an exit, I realize that the video is from surveillance cameras. This realization is also enforced by the morning traffic report where the commentator can flash to a camera that clearly shows an accident blocking traffic virtually anywhere in the city in which I live.
Another favorite of mine is that several days after a tornado, new footage pops up on the news showing the destruction. Sometimes this video comes from a parking lots and sometimes from a gas station. Regardless, there usually ends up being a remarkable amount of footage of a tornado taken from all kinds of different cameras.
People do not realize how much surveillance video is being taken continuously. Almost every store (both inside and out), parking lots, highways, doctor's offices, schools, casinos (especially casinos), airports and fast food restaurants have cameras in place. Individuals with cell phone cameras upload video to YouTube, and neighbors that take surveillance video of their yard and home. (I have a crazy neighbor who has cameras to deter people from letting dogs into his yard, and he has actually taken people to court.) Other examples include dashcams in cars, videos inside restaurants as part of a reality show (Restaurant Stakeout or Mystery Diners) and on and on. It is important to realize that the amount of video footage that is taken is quite enormous even if the video is grainy.
Usually these videos get destroyed after a certain period of time because it is just taking up space. But increasingly, people are keeping the video for much longer so that it can be used as part of a data analysis.
One example is stores that analyze video to understand the shopping and buying habits of their customers. What do they see when they first walk in? Does it capture their attention? Do they have problems navigating through the store? What is their pattern in the store? Do they immediately get what they want/need and then pay for it? Or do they browse? How many times a minute do they blink? Are any of these habits a function of the time of the day? The day of the week? The month? Are they affected by weather? Are they affected by "other" events going on in the world? All of this is an effort to provide better store layouts, better signs and direction, and of course, to get you to buy more.
The algorithms to process images into information are under very serious development with the goal of understanding how things can be changed to improve sales. Consequently, stores are keeping this data much longer so they can get a history of the habits of their shoppers. In many cases this video data is too large to be kept on storage within the company, so they are increasingly resorting to the cloud. However, things are about to get worse, at least for people interested in keeping the data and keeping it in the cloud.
I think everyone understands the difference between a 720p television and a 1080p television. The 720p means that there are 720 horizontal scan lines of image display resolution (720 pixels of vertical resolution). 1080p means there are 1080 horizontal scan lines of image display resolution (1080 pixels vertically or with a 16:9 aspect ratio, 1920 x 1080 resolution). Currently, surveillance cameras generally use a much lower resolution so they don't have to store as much data. But stores, casinos, parking lots, etc. are upgrading so they can capture much higher resolution images which helps identify problems or issues. Insurance companies love this because they have more information about events such as shoplifters or accidents that they can use in court.
But wait! There's more! In some cases 4K videos (approximately 4,000 pixels vertically or with a 16:9 aspect ratio, 3840 x 2160 pixels) are being used for these videos. Unmanned Aerial Vehicles (UAVs) are already using very high resolution cameras, sometimes well above 4K. An example is DARPA's project for a 1.8 gigapixel camera. For a square image that is a resolution of about 42,426 x 42,426 or using a 16:9 ratio that is 48,373 x 37,210. According to the previous link, to record video for an entire city at 12 frames per second for one day, produces about 6 PB of data.
Let's assume that we're currently taking video at 12 frames per second and a 16:9 aspect ratio. The table below lists the total pixel count for each resolution.
|Description||Resolution||Total pixels||Size relative to VHS|
|VHS||480 x 320||153,600||1.0|
|720p||1280 x 720||921,600||6.0|
|1080p||1920 x 1080||2,073,600||13.5|
|4K||3840 x 2160||8,294,400||54|
|8K||7680 x 4320||33,177,600||216|
|DARPA 1.8 gigapixel||48,373 x 37,210||1,799,959,330||11,718|
Just going from VHS resolution to a 720p resolution increases our data storage requirements by a factor of 6. Going to a 4K resolution directly from VHS increases the data storage requirements by a factor of 54!
The trend is fairly easy to understand: many more cameras + much higher resolution + longer retention time = massive increase in data. And remember, this is only surveillance video. There are more areas where data volumes are increasing like this.
Implications — Parallel Cloud Storage
There are two important implications of the increase in data. The first is that people are going to be doing much more analysis on different varieties of data. They won't be doing a single large analysis across a bunch of data. Rather, they will be running multiple analyses, asking different questions, resulting in accessing the data from different servers.
For example, one person or one team could be using video data as part of an analysis of traffic patterns in the parking lot to better understand how the traffic flow for either a redesign of the lot or to reduce insurance costs. A second person or second team, could be using the same video data to measure how long people spent inside the store, which is then correlated to how much was spent during that time ($/minute). They are both equally important and they are both utilizing the same data.
A second implication is that these data files are going to become even larger. Will the files be too big for the memory of a single node? Will the processing have to split across nodes? At this point, performance is likely to be driven by the performance of data access by a single node (single core in some cases).
I think both implications point to the fact that cloud data access and cloud data storage will have to go parallel. Leaving things in HDFS keeps the bottleneck on local data access performance. You can keep increasing the performance of local data access by using more disks, more controllers, or switching to SSD drives. All of this points to more and more cost and, more importantly, it is per node! The same is true for other Cloud file systems (S3, Swift, etc.). Moreover, file systems that rely on a REST interface are not likely to be useful for this situation because they do not allow moving a file pointer to a different location in the file for reading or writing data.
What is needed is a parallel access cloud file system that is (most likely) not based on REST/SOAP. These file systems allow applications to open a file, read and write to it, move the file pointer (lseek) and do other useful file operations. To prove out this point, examine the file systems the largest HPC systems are using. You will see some file systems repeated over and over, such as Lustre, GPFS and Panasas. These file systems allow IO from multiple threads to a single file (Note - these threads can be running on multiple nodes).
Whatever the solution ends up being, what is clear is that cloud workloads are going to be pushing the boundaries of what can be done with a REST interface and are going to need the ability to do parallel IO. We need to rethink what we're doing in the cloud storage world and start considering parallel cloud storage.