SHARE

Data Storage, AI, and IO Patterns

Also see: Artificial Intelligence and Data Storage AI is one of the hot new topics in computing and with good reason. New techniques in Deep Learning (DL) involving Neural Networks (NN), has the ability to create NN’s that achieve better than human accuracy on some problems. Image recognition is an example of how DL models […]

Written By

Jeffrey Layton

May 13, 2018

9 minute read

Enterprise Storage Forum content and product recommendations are editorially independent. We may make money when you click on links to our partners. Learn More

Also see: Artificial Intelligence and Data Storage

AI is one of the hot new topics in computing and with good reason. New techniques in Deep Learning (DL) involving Neural Networks (NN), has the ability to create NN’s that achieve better than human accuracy on some problems. Image recognition is an example of how DL models can achieve better than human accuracy in identifying objects in images (object detection and classification).

An example of this is the Imagenet competition. Since 2010 the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) has been used as a gauge of the improvement in image recognition. In 2011, the best error rate was around 25% (the tool could correctly identify 75% of the images outside of the training data set). In 2012, a Deep Neural network (DNN) had an error rate of 16%. In the next few years the error percentage dropped to the single digits. In 2017 29 of the 36 competing teams got less than 5% wrong which is typically better than a human.

Deep Learning use various types of Neural Networks and can be applied to a large range of problems. There are typically two major steps in creating a model (another name for a Neural Network). The first step is called training. This is the process of having the model repeatedly read the input data set and adjusts the parameters of the model to minimize the error (the difference between the correct output and the computed output). This step requires a massive amount of input (IO) relative to what you might be used to. It also requires an extreme amount of computation.

The second step occurs after the model is trained and is referred to as inference. This is the deployment of the trained model in production. Production means that the model is put to use reading data that was not used in training. It produces output that is used for some task rather than training a neural network. This step also has a computational component to it. Rather than massive amounts of computation, it needs to achieve goals such as minimize latency, achieve the best possible accuracy, maximize throughput, and maximize energy efficiency.

The software for performing the computations for both steps is accomplished by a framework. These are software tools and libraries that read a script, typically written in Python, to tell the framework what operations are needed and what the neural network looks like. The code is then read by the framework and then executed. Examples of frameworks are Tensorflow, or Caffe, or PyTorch.

Issues around IO patterns

By examining how Deep Learning (DL) frameworks function, an understanding of the IO patterns can come about. You don’t need to know details about the specific framework and you don’t need to understand the math behind neural networks either.

The basic flow of of the training step in a DL framework is fairly straightforward. Neural networks require quite a bit of diverse input data to properly train a network to perform a task. It can be in the form of images, videos, volumes, just numbers, or a combination of almost any data.

Did I mention, you need a great deal of data. In addition he data has to be very diverse with a wide range of information for each input. For example, simple facial recognition that determines if the person is a man or a woman requires well over 100 Million images.

The input data can be stored in a variety of ways ranging from simple csv files for really small amounts of input data for learning about DNN’s, to databases that container the data including images. The data can also be spread across different formats and tools as long as the DNN has access to the data and understands the input format. It can also be a combination of structured and unstructured data, as long as you, the user, knows the data and the formats and can express those in the model.

The size of the data on storage media can vary quite a bit. At one extreme are simple images from the MNIST data set that are 28 x 28 grayscale images (values ranging from 0 to 255). That’s a total of 784 pixels – really small. Today we have televisions and cameras that have 4K color resolution. That is 4,096 x 4,096 pixels for a total of 16,777,216 pixels.

The 4K color representation typically starts with 8 bits (256 choices) or can be as large as 16 bits of information (https://dgit.com/4k-hdr-guide-45905/). This can result in very large images. If you make one 4K image a single uncompressed tiff file with a resolution of 4520 x 2540 and 8 bits, the size is 45.9 MB. For 16-bit colors, the size is 91.8 MB.

If you had 100 Million images, a reasonable amount for some facial recognition algorithms, you could have 100 Million files. This isn’t too bad for today’s file systems (see this article or this one). The total space used in the case of 8-bit images is 4.59 PB. That’s quite a bit of space for a single NN using large high-resolution images.

In general, a neural network has two phases while training the network. The first phase is called feed-forward. It takes the input and processes it through the network. The output is compared to the correct output to create an error. This error is then propagated back through the network (back propagation) to adjust the parameters of the network to hopefully reduce the error the network produces.

This process continues so that all images are processed through the network. This is called an epoch. To train a network to a desired level of performance may take hundreds, thousands, or tens of thousands of epochs. The Deep Learning frameworks such as Tensorflow, or Caffe, or PyTorch. takes care of the whole process for a network model that you create.

Overall IO process

A quick summary of the IO pattern for Deep Learning is that the data is read again and again. DL is extremely read heavy. Note that there is some writing but it is small compared to reading because it’s mostly check pointing during the NN training. However, to improve the NN training some options are used that impact the IO patterns.

As an example of the amount of data read or written, let’s assume the network requires 100 Million images where each image is 45.9 MB. Moreover let’s assume that the network model requires about 40MB to save and we save it every 100 epochs and we need to 5,000 epochs to train the model.

As mentioned earlier, one epoch would require reading 4.59 PB of data. This needs to be repeated 5,000 times. This is a total of 22.95 Exabytes that needs to be read. It’s also requires reading 500 Billion files if each image is a single file.

For write IO, the model needs to be written 50 times. That’s a total of 2 GB’s and 50 writes. In comparison to the reads, this is vastly smaller.

For this example, 10 Billion read IO’s are performed for a total of 459 PB’s. This is followed by 40MB of write IO. This is repeated for a total of 50 times for the overall IO pattern.

This is the basic IO pattern for DNN’s for facial recognition like applications. To reduce the training time, several techniques can be used. The topic of the next section is a quick overview of these techniques from the perspective of IO.

Training techniques

One of the first techniques used in NN training is random shuffling of the input data. It is used virtually all the time to reduce the required epochs and to prevent over fitting (optimizing the model to the data set but the model performs poorly on real world data).

Before a new epoch is started, the order that the data is read is randomly shuffled. This means that the read IO pattern is random based on each image. It’s sequential while reading the individual image, but in between images, it’s random. Therefore it’s difficult to characterize the pattern as “re-read” rather than “read” because of the randomness.

There are also frameworks that can read data from a database. The IO pattern is still very read heavy and there can be random shuffling of the data. This can complicate understanding the IO pattern in detail because a database is between the stroage and the framework.

Sometimes frameworks will also use the mmap() function for IO. It is a system call that maps files or devices into memory. When mapping an area of the virtual memory to files, it is said to be a “file-based mapping”. Reading certain areas of memory will the file to be read. This is the default behavior.

Regardless of whether mmap() is used or not, the IO pattern is still very read heavy following the pattern previously discussed. However using mmap() complicates any analysis because the IO happens directly from the file(s)to memory.

Another common technique for improving training performance is called batching. Rather than update the network after each input image, which includes forward and back propagation, the network is updated after a “batch” of images are input. The back propagation portion of the network performs an operation on the errors, such as averaging them, to update the network parameters. This doesn’t change the IO pattern in general, since the images still have to be read, but it can impact the convergence rate. In general it can slow the convergence but the back propagation happens less frequently, improving the computational rate.

Using batching also helps improve performance when training on GPUs (Graphical Processing Units). Rather than move file from the CPU to the GPU, batching allows you to copy several files to the GPU. This improves the throughput from the CPU to the GPU and reduces data transfer time. Given the example, a batch size of 32 would reduce the number of data transfer to 3,125,000 transfers.

Batching definitely helps convergence but doesn’t really impact the IO pattern. The pattern is still random reads with very little writing. But it can change the amount of output the framework creates.

Data Storage and Deep Learning

AI, particularly Deep Learning, is one the computing techniques that is transforming many aspects of our life. The algorithms for DL require a great deal of data. The amount really depends upon the algorithm and the goal of the resulting network model, but for somewhat complicated modes, it can run into the hundreds of millions of sets of input. Classically, the more data you use for training the model, and the more diverse the data, the better the final trained model will be. This points to very large data sets.

There have been several articles over the last 10-20 years discussing data becoming colder. This means that after the data has been created, it is rarely touched again. From this this articlethat examined data, both engineering and enterprise data, and found some very interesting trends:

Both of the workloads are more write-oriented. Read to write byte ratios have significantly decreased (from 4:1 to 2:1)
Read-write access patterns have increased 30-fold relative to read-only and write-only access patterns.
Files are rarely re-opened. Over 66% are re-opened once and 95% fewer than five times.
Files are rarely reopened.
Over 90 percent of the active storage was untouched during the study.
A small fraction of clients account for a large fraction of file activity. Fewer than 1% of clients account for 50% of file requests.

It is farily easy to summarize the overall use of data.

IO patterns are very write oriented
Data is rarely reused but is kept around

Comparing the IO patterns of DL algorithms you can see that it is almost 100% opposite of classic engineer, HPC, and enterprise applications. DL is very, very heavy read IO oriented. The data is reused while the model is being designed and trained. Even after the model is trained there will be a need to augment the existing training data set with new data, particularly errors in the model output. This is done to improve the model as time progress.