When I started in this business more than 30 years ago, it took a supercomputer to do what a laptop can do today, and networks were in their infancy in places like Stanford. Storage is a lot more complicated these days, and storage architects and administrators need to be on top of a whole lot more than they used to. So with a nod to the now-retired David Letterman, here is my list of the Top 10 things storage architects and admins need to be monitoring and doing.
#10 Looking for Soft Errors
Hard and soft errors on the host, in the network and on devices are going to slow down your system, and in the long run, soft errors usually turn into hard errors. Storage architects need to ensure that they have a management framework that traps these errors and provides alerts to the administration staff, and the administration staff needs to aggressively go after these errors. A case can be made that multiple soft errors could increase the potential for silent data corruption.
#9 Performance Analysis
The days of throwing hardware at performance problems likely ended in 2008; today storage administrators and architects need to monitor performance continually given the complexity of the storage hierarchy. There are so many caches today in the datapath that it is hard to understand how all of them interact. A friend of mine once said that caches serve only two purposes: to reduce the latency for writes and reads, but only if the data fits in the cache. If the data doesn’t fit in the cache, you have a mess on your hands. Understanding performance and knowing the architecture is key to having a cost-effective system for both the short and long term.
#8 Understanding Application Workloads
Applications are what matters, because without application requirements no one would need to buy compute or storage. Therefore, it behooves the people designing and administrating the system to understand fully the applications that will run on the system and what resources they are going to need. Everyone should understand what the applications do to the system and the business objectives for each application. For example, does a specific application need to run in a specific time to meet a business objective? If so, the system must be engineered to what is called peak or peak load. Does the application need to meet its timeline objectives while the system is doing a RAID rebuild, for example, or a controller failover? Defining the expectations for applications and the systems they run on is critical for success.
#7 Object Storage
The POSIX stack really has not changed since the late 1980s (yes, that’s not a typo) except for the addition of asynchronous I/O in the early 1990s. POSIX has a number of known performance limitations for metadata scaling and scaling in general. Object storage was designed to overcome POSIX limitations and has a far simpler application interface. The problem is that we have about 30 years of software development in applications that expect a POSIX interface for I/O, and you are not going to change all of that code overnight. Object storage is in your future and understanding how and where it can fit into the current environment should be a priority.
#6 Software Defined
The use of software defined storage is growing and will continue to grow. My concern goes back to the issue of errors, as much of the software does not have enclosure management and monitoring. Software defined storage is become more widely available, but it is prudent to understand what you are getting and what you are missing from the stack compared to what you got from your previous storage vendor. How important is what you are missing to the mission of the organization and ensuring that you can meet the mission?
#5 Storage Tiers
Storage tiers have been around for decades and used to be called HSM; same function, different name today. The cost of really fast storage (SSDs) in enterprise storage is still five to eight times greater than the cost of spinning disk, but everyone wants SSD performance at disk cost (we all want a Porsche for the price of a Chevy, too). The functionality of tiered storage is reasonably easy, but the hard part is how to manage important applications, as well as less important applications and applications that you do not care about at all within a single namespace and still get the right results out of your tiering software. Management of storage tiers within a workflow is difficult, especially for complex changing workloads, and determining what is important and what is not is also very difficult. A single misbehaving application can hurt the performance of critical work.
#4 Integrating Old and New Technology
How you deal with old technology will have a lot to do with how successful you are in the organization over the long haul. The integration costs of mixing old and new technology can be very high, but the costs of doing nothing will be higher. Figuring out how to integrate old and new technology into the same environment is as much art as science, and doing all of this while meeting your business objectives is not only critical for the organization, but critical for your success in the organization.
#3 New Technologies
Picking the right new technologies is difficult, as you are betting on winners and losers. For a recent example, let’s say you picked an SSD vendor that provided you with a large controller five years ago. There is a high statistical probability that that vendor has been consolidated or is no longer in business. The SSD example I used is similar to other vendor trends that have come and gone. It seems to happen every ten years or so. Anyone remember the large number of new storage controller vendors of the early 2000s, and all the new compute vendors in the early 1990s (Kendall Square, MultiFlow etc.)? Picking the right vendors for new technology requires a good understanding of both funding and technology.
#2 The Cloud
Not a day goes by where you don’t hear a prediction that a certain percentage of all computing will be moving to the cloud by a date in the not too distant future. If all of this is true, many of us are going to be out of jobs. Like it or not, many of our organization’s business functions are going to move to the cloud, and fighting tooth and nail for each application is not likely a good long-term strategy to ensure continued employment. The key here is to figure out what common activities can and should move to the cloud, with a primary example being email, and what applications should not move to the cloud. You are going to have to develop cost models and issues for the long term to show what makes the right business sense. Just saying no is not the right answer.
#1 Security
The quickest way to lose your job is to have a security breach with your architecture or your technical decisions. Security needs to be the number one thing you think about all of the time. Of course, many security breaches will not be in the storage domain, but it is likely that old saying about throwing the baby out with the bathwater might apply. Have a documentation trail showing that you have done your best to ensure that the system is as safe as you can make it.
Final Thoughts
This isn’t an easy list to stay on top of, but will be important to your continued success in our fast-changing industry. This list is a snapshot of what I see today and is likely good for at least a few years. For sure, this is not a static list — just like we are not in a static industry — but things in our industry seem to be cyclical. I have long said that there are no new engineering problems, just new engineers solving old problems.
Photo courtesy of Shutterstock.