"My God. It's Full of Data" - Bowman (My apologies to 2001: A Space Odyssey)
Just in case you weren't sure, there is a huge revolution happening. The revolution is around using data. Rather than developers writing explicit code to perform some computation, machine learning applications, including supervised learning reinforcement learning and statistical classification applications can use the data to create models. Within these categories there are a number of approaches, including deep learning, artificial neural networks, support vector machines, cluster analysis, Bayesian networks and learning classifier systems. These tools create a higher level of abstraction of the data, which, in effect, is learning, as defined by Tom Mitchell (taken from Wikipedia):
"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."
After learning, these tools can make predictions based on new input data. Rather than create code with sets of rules and conditions to model a problem or a situation, these algorithms utilize only the data to form their own rules and models.
There are other algorithms within data analysis that focus on the data with the goal of discovering patterns or models that can explain or inform, as well as predict. There are also areas, such as descriptive statistics that can also create useful information about data.
There have been lots of articles written about data analysis, machine learning, deep learning, exploratory data analysis, confirmatory data analysis, big data and other related areas. Generally, the perception is that these tools are used for problems that involve commerce or games with some early scientific applications. However, the truth is that they can be applied to virtually any problem that has data associated with it. Utilizing this data, we can create models and patterns for the purpose of learning more about the overall problem.
In this article, I want to discuss a few ideas for using these techniques in the realm of storage.
Learning IO Patterns
Understanding or quantifying the IO pattern of applications has long been a tedious and often extremely difficult task. However, with an understanding of the IO pattern, you can tailor the storage solution to the application. This could mean improved performance to match the needs of the application, or a more cost-effective storage solution. You can also figure out the IO pattern and then modify the code to improve the IO pattern of the application (however you want to define "improve").
When I'm trying to understand the IO pattern of an application or a workflow, one technique that I use is to capture the strace of the application, focusing on IO functions. For one recent application I examined, the strace output had more than 9,000,000 lines, of which a bit more than 8,000,000 lines were involved in IO. Trying to extract an IO pattern from more than 8,000,000 lines is a bit difficult. Moreover, if the application is run with a different data set or a different set of options, new IO patterns may emerge. Data analysis tools, particularly machine learning, could perhaps find IO patterns and correlations that we can't find. They might also be able to predict what the IO pattern will be based on the input data set and the application runtime options.
There are many possible tools that could be applied to finding IO patterns, but before diving deep into them, we should define what the results of the analysis are to be. Do we want to create a neural network that mimics the IO behavior of the application over a range of data sets and input options? Or do we want to concisely summarize the IO patterns with a few numbers? Or perhaps do we want to classify the IO patterns for various data sets and application options?
The answers to these and other questions help frame what type of data analysis needs to be done and what kind of input is needed.
One important thing to note is that it is unlikely that learning or characterizing IO patters would be done in real time. The simple reason is that to best characterize the pattern, one would need the complete time history of the application. Thus, it can't be done in real time. However, capturing the IO time history of an application is the key to learning and characterizing the IO pattern.
Learning about or characterizing the IO patterns, while extremely important, is not enough. The characteristics of the storage itself, both hardware and software, must be determined as well. Knowing the likely IO pattern of an application would then allow either the IO performance to be estimated for a given storage solution or allow a buyer to choose the best storage solution.
Imagine a simple scenario: We have an application with a known IO pattern for the given input data set and the options used to run the application. From the IO characterization, we also know that the IO is estimated to take 10 percent of the total time for one storage system or 15 percent with an alternative storage system. This is data analysis for storage!
At this point we have a choice. We could run the application/data set combination on the first storage system with one estimated total run time and a given cost. Or we could run on it the second one that is slower, causing the run time to increase, but perhaps with a lower cost. What is really cool is that the data analysis of the IO patterns, our use of artificial intelligence, has allowed this decision to be made with meaningful estimates. No longer do we have to guess or use our gut feeling or intuition as we do today. We use real, hard data to create models or learn about the IO pattern and create information or knowledge that is actionable.
Learning Data Lifecycles
Since the data explosion of the last few years, organizations have renewed interest in moving data that is not being actively used but still must be retained to very reliable but less performant and possibly inexpensive storage. Many times this type of storage is referred to as an online archive. You have to keep the data available (i.e. not in cold off-line storage), but you may not use it often. Regulatory data is a great example of this. Also, many times, engineering and scientific data is required to be available at all times for review and for other researchers to access.
Knowing when to move data from active, higher-performance and more expensive storage, possibly down to less-expensive, less-performing storage and even further down to online archive is referred to as data lifecycle management. Today the movement of data to different storage tiers is controlled either manually or through a simple set of based on the age of a file, the size or the owner.
Could we use data analysis and learning methods to improve on this approach?
An alternative, and perhaps better, method for data lifecycle management is to begin by gathering information about how the data is used. How often is it accessed? Who access it? What kind of file is it? What type of data does it contain? As we gather this information, learning algorithms can start to create predictive models. Now we have much more information about the data and how it is used. This can be used to create a "predictor" about how they data is likely to be used in the future. This information can help planners make better decisions about the data and when and where to place it.
For example, after a user creates a file, the predictive tool could predict how often the data will be used. Based on the predictions, we could take actions such as immediately moving the file to an online archive or even to cold storage. Or the tool could predict that the data needs to be kept on active storage for at least three months before moving to colder storage.
There are many other possibilities. For example, if data is created with a certain format, but is typically also used in a different format at some point in the future, the predictive tool can tell you this, so that the file can be converted automatically (i.e. no user intervention).
Another useful action from a predictive model would be to notify the users who typically access the data that new data is available. If user janet creates some new data and users ken, john and mary typically access that data, then a simple email or SMS can notify them that new data is available and where it is located. This is a simple variation of how directed marketing is used today. Rather than notify you of things you might want to buy, you get notified when data you typically use or that you might be interested in is available.
Holy Grail of Storage — Failure Prediction
Storage failures can have a huge impact on productivity. When there is a failure, you have to find what data was lost (if any) and then restore the data either from a backup or copy. This takes time and is very disruptive to productivity. For many years, people have tried to develop ways of predicting storage failure.
A great example of this is studying the SMART logs of drives that help develop metrics to predict failure and to find thresholds that define imminent failure and remove that drive from service before it fails. However, researchers found that a specific metric or small set of metrics could not accurately predict this failure. Perhaps this is another area where AI, particularly machine learning, can help.
There are multiple approaches to machine learning and failure analysis but one approach is to gather data from the storage system itself. This could include all kinds of measurements and metrics including SMART data, SMART tests, information from the file system, diagnostics from any controllers, file system metadata information, application performance, etc. If the data includes storage failures, then you could take the data and classify it according to a functioning storage system and a failed or soon-to-be-failed storage system. In this case, it's a simple binary classifier, but it might include enough information to help you predict if a failure is imminent.
Other approaches could go far beyond the scope of this article. A Hidden Markov Model (HMM) could also prove to be a good tool for modeling storage failure.
Opportunities for AI and Storage
There are two trends happening that can prove to be an opportunity for storage: (1) we are collecting lots of data all time, and (2) useful data analysis tools are available and being rapidly developed.
We could apply this combination to all sorts of storage issues to create useful information for administration and design of storage systems. As storage systems grow larger, the ability for a full-time engineer to manage, monitor and maintain a petabyte-scale storage system is extremely limited. Managers can harvest lots of data about the storage system and the applications using the storage, but creating actionable information from that data is virtually impossible without the use of data analysis and AI tools.
Photo courtesy of Shutterstock.