Artificial Intelligence and Data Storage
"My God. It's Full of Data" - Bowman (My apologies to 2001: A Space Odyssey)
Just in case you weren't sure, there is a huge revolution happening. The revolution is around using data. Rather than developers writing explicit code to perform some computation, machine learning applications, including supervised learning reinforcement learning and statistical classification applications can use the data to create models. Within these categories there are a number of approaches, including deep learning, artificial neural networks, support vector machines, cluster analysis, Bayesian networks and learning classifier systems. These tools create a higher level of abstraction of the data, which, in effect, is learning, as defined by Tom Mitchell (taken from Wikipedia):
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."
After learning, these tools can make predictions based on new input data. Rather than create code with sets of rules and conditions to model a problem or a situation, these algorithms utilize only the data to form their own rules and models.
There are other algorithms within data analysis that focus on the data with the goal of discovering patterns or models that can explain or inform, as well as predict. There are also areas, such as descriptive statistics that can also create useful information about data.
There have been lots of articles written about data analysis, machine learning, deep learning, exploratory data analysis, confirmatory data analysis, big data and other related areas. Generally, the perception is that these tools are used for problems that involve commerce or games with some early scientific applications. However, the truth is that they can be applied to virtually any problem that has data associated with it. Utilizing this data, we can create models and patterns for the purpose of learning more about the overall problem.
In this article, I want to discuss a few ideas for using these techniques in the realm of storage.
Learning IO Patterns
Understanding or quantifying the IO pattern of applications has long been a tedious and often extremely difficult task. However, with an understanding of the IO pattern, you can tailor the storage solution to the application. This could mean improved performance to match the needs of the application, or a more cost-effective storage solution. You can also figure out the IO pattern and then modify the code to improve the IO pattern of the application (however you want to define "improve").
When I'm trying to understand the IO pattern of an application or a workflow, one technique that I use is to capture the strace of the application, focusing on IO functions. For one recent application I examined, the strace output had more than 9,000,000 lines, of which a bit more than 8,000,000 lines were involved in IO. Trying to extract an IO pattern from more than 8,000,000 lines is a bit difficult. Moreover, if the application is run with a different data set or a different set of options, new IO patterns may emerge. Data analysis tools, particularly machine learning, could perhaps find IO patterns and correlations that we can't find. They might also be able to predict what the IO pattern will be based on the input data set and the application runtime options.
There are many possible tools that could be applied to finding IO patterns, but before diving deep into them, we should define what the results of the analysis are to be. Do we want to create a neural network that mimics the IO behavior of the application over a range of data sets and input options? Or do we want to concisely summarize the IO patterns with a few numbers? Or perhaps do we want to classify the IO patterns for various data sets and application options?
The answers to these and other questions help frame what type of data analysis needs to be done and what kind of input is needed.
One important thing to note is that it is unlikely that learning or characterizing IO patters would be done in real time. The simple reason is that to best characterize the pattern, one would need the complete time history of the application. Thus, it can't be done in real time. However, capturing the IO time history of an application is the key to learning and characterizing the IO pattern.
Learning about or characterizing the IO patterns, while extremely important, is not enough. The characteristics of the storage itself, both hardware and software, must be determined as well. Knowing the likely IO pattern of an application would then allow either the IO performance to be estimated for a given storage solution or allow a buyer to choose the best storage solution.
Imagine a simple scenario: We have an application with a known IO pattern for the given input data set and the options used to run the application. From the IO characterization, we also know that the IO is estimated to take 10 percent of the total time for one storage system or 15 percent with an alternative storage system. This is data analysis for storage!
At this point we have a choice. We could run the application/data set combination on the first storage system with one estimated total run time and a given cost. Or we could run on it the second one that is slower, causing the run time to increase, but perhaps with a lower cost. What is really cool is that the data analysis of the IO patterns, our use of artificial intelligence, has allowed this decision to be made with meaningful estimates. No longer do we have to guess or use our gut feeling or intuition as we do today. We use real, hard data to create models or learn about the IO pattern and create information or knowledge that is actionable.