Using AI for Data Storage: Page 2 -

Using AI for Data Storage - Page 2

Learning Data Lifecycles

Since the data explosion of the last few years, organizations have renewed interest in moving data that is not being actively used but still must be retained to very reliable but less performant and possibly inexpensive storage. Many times this type of storage is referred to as an online archive. You have to keep the data available (i.e. not in cold off-line storage), but you may not use it often. Regulatory data is a great example of this. Also, many times, engineering and scientific data is required to be available at all times for review and for other researchers to access.

Knowing when to move data from active, higher-performance and more expensive storage, possibly down to less-expensive, less-performing storage and even further down to online archive is referred to as data lifecycle management. Today the movement of data to different storage tiers is controlled either manually or through a simple set of based on the age of a file, the size or the owner.

Could we use data analysis and learning methods to improve on this approach?

An alternative, and perhaps better, method for data lifecycle management is to begin by gathering information about how the data is used. How often is it accessed? Who access it? What kind of file is it? What type of data does it contain? As we gather this information, learning algorithms can start to create predictive models. Now we have much more information about the data and how it is used. This can be used to create a "predictor" about how they data is likely to be used in the future. This information can help planners make better decisions about the data and when and where to place it.

For example, after a user creates a file, the predictive tool could predict how often the data will be used. Based on the predictions, we could take actions such as immediately moving the file to an online archive or even to cold storage. Or the tool could predict that the data needs to be kept on active storage for at least three months before moving to colder storage.

There are many other possibilities. For example, if data is created with a certain format, but is typically also used in a different format at some point in the future, the predictive tool can tell you this, so that the file can be converted automatically (i.e. no user intervention).

Another useful action from a predictive model would be to notify the users who typically access the data that new data is available. If user janet creates some new data and users ken, john and mary typically access that data, then a simple email or SMS can notify them that new data is available and where it is located. This is a simple variation of how directed marketing is used today. Rather than notify you of things you might want to buy, you get notified when data you typically use or that you might be interested in is available.

Holy Grail of Storage — Failure Prediction

Storage failures can have a huge impact on productivity. When there is a failure, you have to find what data was lost (if any) and then restore the data either from a backup or copy. This takes time and is very disruptive to productivity. For many years, people have tried to develop ways of predicting storage failure.

A great example of this is studying the SMART logs of drives that help develop metrics to predict failure and to find thresholds that define imminent failure and remove that drive from service before it fails. However, researchers found that a specific metric or small set of metrics could not accurately predict this failure. Perhaps this is another area where AI, particularly machine learning, can help.

There are multiple approaches to machine learning and failure analysis but one approach is to gather data from the storage system itself. This could include all kinds of measurements and metrics including SMART data, SMART tests, information from the file system, diagnostics from any controllers, file system metadata information, application performance, etc. If the data includes storage failures, then you could take the data and classify it according to a functioning storage system and a failed or soon-to-be-failed storage system. In this case, it's a simple binary classifier, but it might include enough information to help you predict if a failure is imminent.

Other approaches could go far beyond the scope of this article. A Hidden Markov Model (HMM) could also prove to be a good tool for modeling storage failure.

Opportunities for AI and Storage

There are two trends happening that can prove to be an opportunity for storage: (1) we are collecting lots of data all time, and (2) useful data analysis tools are available and being rapidly developed.

We could apply this combination to all sorts of storage issues to create useful information for administration and design of storage systems. As storage systems grow larger, the ability for a full-time engineer to manage, monitor and maintain a petabyte-scale storage system is extremely limited. Managers can harvest lots of data about the storage system and the applications using the storage, but creating actionable information from that data is virtually impossible without the use of data analysis and AI tools.

Photo courtesy of Shutterstock.

Page 2 of 2

Previous Page
1 2

Comment and Contribute


(Maximum characters: 1200). You have characters left.



Storage Daily
Don't miss an article. Subscribe to our newsletter below.

Thanks for your registration, follow us on our social networks to keep up-to-date