Tuning RAID controllers is not as difficult as some vendors would have you believe; there’s no need for professional services to get the job done.
Many of the parameters reside around the cache and cache usage, along with the obvious tunable parameters for the RAID LUNs. This article isn’t about tuning specific RAID controllers; for that, you will need to spend some time reading the documentation, but hopefully by reading this you will be able to consider the parameters in context with the I/O of the whole system. Each vendor has its own nomenclature for variable names and what they mean. As there is no standard set of definitions, I have chosen my own, which you should be able to apply to a specific vendor. The areas that need to be considered are: LUN creation and RAID level, and cache tuning and configuration.
Figuring out what RAID levels to use has been pretty well covered (see RAID Storage Levels Explained), so we’ll stick to the subject of RAID tunables here. If you configure RAID to optimize your system, whether that be a RAID controller card on your PC or a high-end mission-critical enterprise RAID array, you should have a good understanding of what to consider after reading this article.
We’ll start by considering what type of RAID controller you have. Today they can be broken down into three categories:
- Enterprise Active/Active: This type of controller allows you to write from any host to any LUN without performance degradation. These controllers usually have large mirrored caches (usually over 32 GB), and the controllers are designed for hot swap everything and very high up time. Communication to the controller today is over Fibre Channel, and soon FCoE.
- Midrange Active/Passive: This type of controller has two sides for each LUN; an active side, which is a primary path, and a passive side, which is used for failover. You typically divide the LUNs between primary and failover, evenly dividing your system. Cache can be mirrored in the controller, but these controllers are not as resilient as enterprise controllers. Communication to the controller today is over Fibre Channel, and soon FCoE.
- RAID Host Cards: These are cards that plug into PCIe and connect to the drives via SAS or SATA connection. These cards do not have processors as large as midrange or enterprise controllers, nor do they support as many drives. Failover to another controller is not possible, and your system is only as resilient as your PCIe slot and controller card.
Many RAID vendors think only about their devices and storage. They somehow think that storage is only allocated sequentially from the host and that storage is a raw device that is allocated sequentially. Although this view is changing somewhat, I still run into these bizarre vendor views that the whole world uses nothing but raw devices and databases and that files are written to one at a time. Block-based file systems don’t allocate data sequentially.
RAID Cache Tuning and Configuration
RAID cache tuning can be broken down into three areas:
- Tuning cache, both read-ahead and write-behind
- Tuning cache block sizes
- Tuning cache for mirroring (important for midrange controllers)
Read-ahead and Write-behind: You might think that read-ahead and write-behind behavior would be the same, but they are actually quite different.
If read-ahead – reading data before the request is made by reading sequential blocks on the disk –is to work, it assumes that the data will be read sequentially and that the it is allocated on sequential block addresses. RAID controllers do not know the topology of the file system or the data; all they know is sequential block addresses, so controller I/O requests are for sequential block addresses. If your file system allocation is less than your RAID stripe size, then files are likely to be fragmented within these RAID stripes if more than one file is being written to at the same time.
If, for example, the file system allocation is 64 KB and the RAID 5 8+1 stripe is 512 KB and multiple files are being written, what most RAID controllers do is read the data you requested, in this case 64 KB, and maybe another 64 KB, and if you read again sequentially, then often the whole stripe. On the other hand, if you read just a single 64 KB block and the rest of the stripe has data from other files, then read-ahead only hurts. Match RAID stripes to file system allocations and some underlying knowledge of how many files are being written at the same time, and you’ll get a good picture of the impact read-ahead could have on your system. This should give you a good understanding of methods for tuning for read-ahead in your RAID.
Write-behind – reading blocks into cache so they can be written – provides significant value if the data being written is aligned to the stripe value of the RAID, as it gives the writer acknowledgement of the write as soon as the data hits the cache. The key here is that the data must be aligned to the RAID stripe, which depending on the file system can often be difficult. If it is not aligned, then the RAID controller must do a read-modify-write (read the stripe in, modify with the new data, write the stripe out), which has high overhead and latency. The purpose of RAID cache in this case is to hide the latency of writing to disk and receive acknowledgment as soon as the data hits the cache. Tuning for write-behind often involves figuring out how much cache space to allocate for writing compared to read-ahead for some controllers, and tuning also involves the minimum cache block size that can be read or written.
Tuning RAID Cache Block Sizes
The cache block size is the minimum amount of data that can be read into the cache. For example, a RAID allocation on a disk might be 32 KB and you would think that all I/O to and from the disk is 32 KB, but if the cache block size is, say, 4 KB, then the minimum read or write to that device is 4 KB. This is eight times today’s disk sector size. If your file system allocations are large and your write requests are large, then having a small cache block size likely reduces the performance of the RAID, as most RAID controllers I have seen slow down with smaller block sizes because they do not have the CPU power to manage all of the blocks. This might become less true as the next crop of controllers come out with much higher performance CPUs, but having small cache blocks is necessary when data is not aligned to the allocation within the RAID of a single disk.
Take the case where you write in small requests and read in large requests and the file system allocation is as large as the RAID stripe. If that is the case, there is a likelihood that the file system is not so heavily fragmented that multiple writes will be sequentially allocated and read-ahead will likely help. Read-ahead will also help if the writes are bigger than the reads, as all RAID controllers will see the smaller reads as sequential. So when tuning for reads, you need to understand the read request size compared to the write request size and determine how many files are being written at the same time. If the answer is one write at a time, then data will likely be allocated sequentially unless the file system is fragmented, and read-ahead will provide great benefit. On the other hand, if there are multiple files being written and the write size and the allocation in the file system are less than the stripe size, then read-ahead will provide little to no value. It comes down to this: read-ahead works with writes and allocations equal to or greater than the stripe size of the RAID if there are multiple files being written.
Tuning Cache for Mirroring
Write cache mirroring is a common feature in many midrange RAID products, and all writes are mirrored in enterprise controllers. The controller takes the I/O request and writes it to the cache on the other half of the controller in case the part of the controller being written to fails. There are some vendors that have techniques for bypassing write cache mirroring requirements in the controller if the data is aligned on a full stripe, but in a general-purpose environment with write cache mirroring, each write is written to cache and then written to the other cache before the acknowledgement is given to the I/O request. Write cache mirroring therefore generally slows performance because of the latency and bandwidth requirements for the write to the other cache, and each cache must mirror the other so often you lose half of the cache space for mirroring the other cache.
If the vendor has tunable parameters for read and write cache, tuning these based on workload and reliability requirements is something to be considered. The question I often hear is whether users should be using write cache mirror or not. The answer depends on how much data reliability you want. Let’s say you are writing a file and write the data to the cache on a non write cached mirrored system. At the same time, the whole backend of the controller (from cache to disk) fails. At this point, your application has been told that the write was successful, but it never got to disk. Obviously, the chances of this happening are slim, but it is possible and I have seen it happen. If you did another write to the same file, you might get an I/O error, as most RAIDs when they realize that they cannot write from cache to disk cause the error, or the RAID controller might failover to the side that is still working and your write would complete normally, but the file is missing a write and the application doesn’t know it. Missing a write in a file is not a good thing, which is why write cache mirror is on by default. Tuning for write cache mirror involves figuring out how much cache space you want to save for writes, and write cache mirror should be on, as silent data corruption is something you just do not want to deal with no matter how low the odds are. Finding the bad data or lack of data if the controller fails is next to impossible.
Tuning for RAID controllers isn’t that difficult if you understand a bit about the application load based on what the applications will do with the RAID. Read-ahead is often not useful if there are multiple files being written and the file system allocation is small. The best example of this bad situation is NTFS on Windows. For file systems with large allocations, if they are as large or larger than the RAID stripe, then read-ahead will have a significantly positive impact.
Henry Newman, CEO and CTO of Instrumental Inc. and a regular Enterprise Storage Forum contributor, is an industry consultant with 29 years experience in high-performance computing and storage.
See more articles by Henry Newman.