Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
This means that calculation of pre-reading data into cache depends on either a block address being hit at least one time and often multiple times and then moved into flash cache. This is true in a POSIX file system world with attached storage controller figuring out where a file is and moving it. Things get a bit easier with object storage and storage, solutions that know the topology of the data, like the Seagate Kinetic disk drives.
What is the right tier
Finding the right tier is really hard without a priori knowledge. Take the following example with streaming video.
If the data has not been used in a long time it will likely be on the lowest tier of storage in terms of cost and performance. Let’s say you have a 500 MB file that you want to play. If you are only going to read the file one time, why should it be moved from low performance storage to higher performance storage?
On the other hand, what if the application that you opened the file with was video editing software instead of video playback software? If it was an editor, moving the whole file into high performance storage would make great sense. Therein lies or maybe is the problem.
Depending on what applications are being used, the usage of data could be significantly different. And yet you do not know that at the block level or even at the object layer. Figuring out the right tier to move things to might be easy for something like database index tables that are constantly hit or file system metadata when people are doing ls –l. But for data that might be used irregularly then how can anyone know it makes sense to move it? The overhead of moving a file from slow disk to fast flash cache is going to be very costly if it is only read one time or even if the policy is move it if it is read two times.
Moving data around, especially large files that might or might not be read and written in patterns similar to databases or file system metadata, is costly in terms of hardware interconnect design needed for the high speed bandwidth between the tiers. And also costly as you will use the expensive flash cache for data and data accesses that is not really benefiting from using flash.
There are a number of tradeoffs vendors are making. Disk drive vendors are working on understanding access patterns at the drive level for various application workloads to be used in their hybrid drives. Application such as email, web server and alike for a specific class of applications might have access patterns that can be coded into firmware and improve (or likely improve) the performance.
Having flash cache on the disk drive allows data that is reused often to be moved into cache without having to move it off the drive. The challenge in this architecture is that often in tiered storage, besides having slower devices, you have less bandwidth going to each tier. With the storage controller approach you move the data from the slower storage to storage with higher performance. If you do not use the data enough times or have an application that is of critical importance then if the data is not reused enough you just moved data to high cost storage for no reason.
Future Storage Developments
I heard a saying that I really like and use: “Lead, follow or get out of the way.”
The POSIX read/write/open/close interface did not lead and now is following and over the next decade or so is likely going to get out of the way for the most part. We might not like this but this is going to happen. This has an important impact on storage caching, as the biggest problem today with caching is that the topology of a file is not known and cannot be known as it cannot be passed easily through all the layers to where the data resides. Object interfaces will allow an understanding of data topology so that can be passed and maybe even the application usage information could be passed to provide information on usage patterns.
Let’s assume that this happens. This will allow every part of the stack to make better decisions. Disk drives will know what can be cached and what should be cached if a file is opened for read only read/write if the application is streaming writes or IOPS or whatever. I think as we move forward the richness of data that can be passed with an object interfaces is going become available and the vendors up and down the stack will be using them. I do not think this will take long, maybe 5 years.
This will put another nail in the POSIX coffin, which should not be a surprise to anyone. The POSIX file system people at the OpenGroup had the chance to lead, but they chose to follow and now will be made obsolete. This kind of thing happens often in our industry and people seem to never learn from mistakes of the past.
Photo courtesy of Shutterstock.