Storage planning for virtualized systems requires some careful thinking about the issues that will impact users. One major tradeoff of virtualizing systems is that you are trading local bandwidth for network bandwidth. In a nonvirtualzaed environment, each user has local storage for at least booting his system and very often for his data as well. On a virtualized system, the user boots up over the network and accesses applications and data over the network. Here, network performance and, more importantly, network latency is critical. We all know that the seek+latency on a standard enterprise SATA drive is around 4.16 ms for seek and between 8.5ms (read) and 9.5ms (write) depending on the drive manufacturer. This is not going to change, so latency must be considered.
One key issue is the latency on your network and to the storage that the virtualized user will be accessing as compared to the local storage. However, that is far from the only issue. These four issues are paramount and must be considered as well:
- Storage contention
- Storage bandwidth
- File system data and metadata fragmentation
- File system free space
1. Storage Contention
If you are the only user on the system, you have full, un-contested access to the disk drive. You might be running applications in the background, but most likely you and the application you are running in the foreground are consuming most of the disk drive. You have 8 MB, maybe 16MB or even 64MB, of disk cache, and the disk you are using can likely support 100 random IOPS. Not bad.
In contrast, if you are on a virtualized system, you are very likely contending with many other users. Your application does not have a dedicated disk drive, and the total number of IOPS for all the users running MIGHT be less than the 100 you had. Of course, in most virtualized systems you are running on storage controllers often with large caches, which can make a big difference. Contention can be mitigated, but it is difficult to fully determine the contention without a good understanding of the cache usage.
2. Storage Bandwidth
The bandwidth to storage is also something to consider when planning for virtualization. The average disk performance for a Seagate Enterprise 2.5 inch 10K drive is about 130 MB/sec, and a Seagate 3 TB 3.5 inch enterprise SATA/SAS drive is about 112 MB/sec. For each local disk you are getting more than 100 MB/sec if you are streaming data. Of course, file systems do not necessarily allocate data and sequentially stream it, and very few applications that run locally must sustain more than 100 MB/sec, but I think it is important to understand what users need as well as the total required bandwidth across systems to be virtualized. Having a RAID controller with, for example, 5 GB/sec of bandwidth is not going to work if 200 users need 200 different datasets simultaneously and want 30 MB/sec (200*30=6000 MB/sec and the controller does 5120 MB/sec). Of course, this is not seriously oversubscribed, but if everyone is trying to stream HD video you are going to get a large number of complaints. It is pretty easy to oversubscribe the bandwidth of a storage system in a virtual environment.
3. File System Data and Metadata Fragmentation
I am not a big fan of the Windows NTFS file system given the performance I have seen, and given the data and metadata allocation algorithms and the lack of high-speed streaming I/O bandwidth support. It is very likely that local disk performance is going to be nowhere near the theoretical performance of the drive, given that the data will not likely be allocated sequentially. More than likely, the local disk is not going to have significant contention from many applications accessing the same drive. That will likely not be the case in a large RAID environment where data is allocated across the storage by the volume manger, file system and the RAID controller. More than likely, the data is not sequentially allocated, and there is far more contention at the drive level. That means the number of seeks and the latency will likely be much higher than using the local drive. That does not even include the network latency, which should also be considered.
4. File System Free Space
Finding free space for new files in a virtual environment is often more CPU-intensive and takes longer than on a local file system. On a local file system, you can just allocate the space as you own the whole drive. That is not the case on a virtual system. You often must traverse many software layers over a network to get new space. Clearly, this is not a problem if the user is using the space that has been allocated to her, but it is a problem if users are regularly exceeding their allocations. This is not a major problem, but certainly something to consider.
I have been thinking about virtualization for a long time. We all know the reported benefits of virtualized environments where you save money on hardware, software, people resources, power and cooling, and everything else. I think it is important to go into virtualization with your eyes wide open in terms of the storage requirements, both performance and spatial. With hundreds of users each having their own personal disk drive, often you have many tens of GB/sec of bandwidth and tens of thousands of IOPS. Oftentimes, when people centralize storage for virtual environments all they look at is the space needed, and this becomes a big problem quickly. Many vendors that sell storage for virtualized environments now have large caches, both DDR and flash, to reduce the amount of data actually going to disk. Although I did not discuss this, this helps reduce many of the problems discussed above, but it does not eliminate them. If hundreds of users open different files at the same time, and they have not been used recently, cache is not going to help you, as the data will not be in the cache. Hence, you have limited the backend bandwidth of the storage to support those user requests.
There are some critical things you must understand if you are going to virtualize your environment. There is no getting around the fact that virtualization will impact your storage system.
There must be detailed analysis of the user applications and its performance requirements. This is the type of information needed for users doing heavy I/O.
- How much data is read or written?
- What is the request size for reading or writing?
- Are the requests random or sequential?
- Is the application intolerant to latency?
- How many files do users require?
- There must also be some understanding of how much storage is needed and will the users require more. Adding storage requires the user to allocate more space as well as the implications of allocation.
- A reasonable understanding of data and metadata fragmentation should be part of your analysis.
- Do users add and remove many files?
- Do users rewrite files or parts of files?
The problem is that getting all of this information is difficult. Deploying tools on each of the systems to be virtualized, and then getting all of the data and making sense out of it, is very difficult. This is likely why the type of analysis I described it not done very often or is done only when there is a problem. There has been a long history of development of analysis tools and a long history of failure of those tools in the market. Everyone seems to think the tools are too expensive, and the least expensive way to solve any performance problem is to just buy more hardware. This defeats the purpose of virtualization. The reason you virtualize is to reduce your hardware footprint by right sizing.
I still think virtualization is a good thing and provides significant ROI for most environments. What I do not understand is why people trade off the cost of hardware and the infrastructure costs in lieu of buying some software and human resources to do performance analysis on the system to right size the hardware. I have always believed that not doing performance analysis as part of a capacity planning effort is penny-wise and pound foolish. Of course, hardware vendors love you for not doing it.
Henry Newman is CEO and CTO of Instrumental Inc. and has worked in HPC and large storage environments for 29 years. The outspoken Mr. Newman initially went to school to become a diplomat, but was firmly told during his first year that he might be better suited for a career that didn’t require diplomatic skills. Diplomacy’s loss was HPC’s gain.