Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
I have been thinking about virtualization for a long time. We all know the reported benefits of virtualized environments where you save money on hardware, software, people resources, power and cooling, and everything else. I think it is important to go into virtualization with your eyes wide open in terms of the storage requirements, both performance and spatial. With hundreds of users each having their own personal disk drive, often you have many tens of GB/sec of bandwidth and tens of thousands of IOPS. Oftentimes, when people centralize storage for virtual environments all they look at is the space needed, and this becomes a big problem quickly. Many vendors that sell storage for virtualized environments now have large caches, both DDR and flash, to reduce the amount of data actually going to disk. Although I did not discuss this, this helps reduce many of the problems discussed above, but it does not eliminate them. If hundreds of users open different files at the same time, and they have not been used recently, cache is not going to help you, as the data will not be in the cache. Hence, you have limited the backend bandwidth of the storage to support those user requests.
There are some critical things you must understand if you are going to virtualize your environment. There is no getting around the fact that virtualization will impact your storage system.
- How much data is read or written?
- What is the request size for reading or writing?
- Are the requests random or sequential?
- Is the application intolerant to latency?
- How many files do users require?
- There must also be some understanding of how much storage is needed and will the users require more. Adding storage requires the user to allocate more space as well as the implications of allocation.
- A reasonable understanding of data and metadata fragmentation should be part of your analysis.
- Do users add and remove many files?
- Do users rewrite files or parts of files?
The problem is that getting all of this information is difficult. Deploying tools on each of the systems to be virtualized, and then getting all of the data and making sense out of it, is very difficult. This is likely why the type of analysis I described it not done very often or is done only when there is a problem. There has been a long history of development of analysis tools and a long history of failure of those tools in the market. Everyone seems to think the tools are too expensive, and the least expensive way to solve any performance problem is to just buy more hardware. This defeats the purpose of virtualization. The reason you virtualize is to reduce your hardware footprint by right sizing.
I still think virtualization is a good thing and provides significant ROI for most environments. What I do not understand is why people trade off the cost of hardware and the infrastructure costs in lieu of buying some software and human resources to do performance analysis on the system to right size the hardware. I have always believed that not doing performance analysis as part of a capacity planning effort is penny-wise and pound foolish. Of course, hardware vendors love you for not doing it.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn't require diplomatic skills. Diplomacy's loss was HPC's gain.