When Tape and RAID Don't Get Along
Not long ago, I was working at a site where the customer was complaining about tape performance and tape archive reliability issues. These performance issues manifested themselves in a number of different ways: backups were running very slow, with many errors occurring, and older backup tapes were error prone.
For a site that needed high reliability, fast restoration and long-term access to data, this was a bad situation. From a business continuity perspective, the status quo was not going to fly with management or the users.
Clearly I had to figure out the root of the problem. If reading or writing to tape was slow, then what was the cause?
Analyzing Tape Performance
The first thing did I was use the sar command to look at tape performance and understand how bad things really were. After looking at the output from sar, it was clear that the tape drive was running very poorly based on the reported transfer rate to the tape drive. When doing this type of work, you should know the performance of the tape drive for both compressed and uncompressed data. At the bottom of this article, you'll find a table displaying some common tape drives and their performance, according to vendor Web sites. Your mileage may vary, of course.
You should know the expected performance of a tape drive, including the expected compression performance, as part of the performance analysis process. Compression varies with the type of data, but enterprise drives from IBM and Sun/STK generally will provide better compression than LTO drives, given the compression chipset being used.
Operating System and Application
Since I knew the tape drive type and did a quick estimation on compression using gzip and a random sampling of files, I knew that the tape drive data rate was running at less than 30% of what it should be. Since tape drives only write the data they get, it was time to check the connection to the tape drive, operating system settings, and RAID configuration and settings.
I looked at the tape HBA settings and everything was set correctly: HBA was set for tape emulation and a large command queue, and no errors were being received on the HBA.
The next step was to look at the operating system configuration and information for the application writing the tapes. Here I found a couple of problems.
The operating system was not tuned to allow requests over 128 KB to be read or written. Since the tape block size is 256 KB, this was causing multiple I/O requests for a single tape block.
The application writing/reading the tape drive only had four readahead/writebehind buffers. Given the latency from the RAID and to the tape drive, this could be a serious problem, but bigger problems lurked as I dug deeper.
When I looked at the RAID, the pieces of the puzzle became clear, and although I have often recommended against doing what this customer did, I never had a reason to say this is a really bad idea in all environments. What this customer had done was configure at RAID-5 5+1 with 64 KB segments. That means that a full stripe of data would be 320 KB, while a read or write from the tape device would be 256 KB — data was being read and written to the tape at a different block size than the device was prepared to handle, resulting in significant work for the kernel. Since the data rate of the tape drive with compression was nearly the rate of the 2 Gb RAID, the problem was clear. This non-power of two LUN size was a significant mismatch, which was the main cause of the performance problem. Take the following example:
|Disk 1||Disk 2||Disk 3||Disk 4||Disk 5||Disk 6|
|Parity||64 KB||64 KB||64 KB||64 KB||64 KB||320 KB Stripe 1|
|64 KB||64 KB||64 KB||64 KB||64 KB||Tape Data Write/read Block 1||Tape Data Write/read Block 2|
|64 KB||Parity||64 KB||64 KB||64 KB||64 KB||320 KB Stripe 2|
|64 KB||64 KB||64 KB||64 KB||64 KB||Tape Data Write/read Block 2||Tape Data Write/read Block 3|
|64 KB||64 KB||Parity||64 KB||64 KB||64 KB||320 KB Stripe 3|
|64 KB||64 KB||64 KB||64 KB||64 KB||Tape Data Write/read Block 3||Tape Data Write/read Block 4 Partial|
Clearly you are not reading or writing full stripes of data from the RAID device, and after the first read or write, you will have to do a head seek for every I/O, since you are not reading or writing full stripes. Every fifth block read or written will not require a head seek, but this is not going to provide good performance if almost every I/O requires reading two stripes. Even if the stripe size was a large number, say 256 KB, this might improve performance of this type of configuration, but it is still not optimal.
One of the reasons this will become an issue in the future is that tape block sizes will increase over time, so even if you reduce the impact of the problem today by using larger per disk allocations, the likelihood that this will alleviate the problem in the future is low. Using the same example above with say, 256 KB per disk allocation, creates a stripe size of 1280 KB. If tape block size moves to the range of 1 MB or 2 MB or even greater, then you'll have the same problem all over again.
Power of Two a Problem
Why do vendors and owners set up RAID devices to have non-power of two allocations? (By power of two, I mean data drives; e.g., 8+1 RAID-5 is a power of two.) From what I can tell, it is done for three reasons:
- From the vendor side, a number of vendors do not support power of two RAID allocations for their RAID devices using RAID-5. This seems to me to be pretty ignorant of many of the applications that are out there. Databases index files are often powers of two, database table sizes are often powers of two, applications doing reads for Web servers are very often powers of two, the C Library buffer size (fwrite/fread) allocations are powers of two, along with many other applications. I don't get why vendors would not do something so basic for RAID performance. The response I get is that we do readahead, which is correct but assumes that the file system sequentially allocated the data, which often is not true.
- RAID owners are concerned with write reconstruction time and often set up the configuration around the time it takes to reconstruct a LUN.
- RAID owners often set up devices based on drive count and wasted parity drives. If you buy 10.2 TB in 300 GB drives and up with 4+1 RAID-5 LUNs, you would have six LUNs for a total of 7.2 TB of data space. Using the same example and setting up as 9+1 RAID-5 would give you 8.1 TB of data space, while a RAID-5 16+1 would give you 9.6 TB (but no hot spares). This is one of the main reasons I have seen for odd combinations.
Powers of two are important for tape performance for RAID configurations, but powers of two are also important for many other application types. In the HPC world where I usually work, many applications require powers of two for their allocation in memory and in many cases for their allocation of CPU counts, and the same is true for I/O to storage for algorithms such as FFTs.
Why people use non-powers of two, or even worse, prime numbers for RAIDs, is a sign that there needs to be a great deal more education on data movement systems issues. The problem is that everyone looks at their own hardware and software design and development in a vacuum. This is true not just for hardware and software developers, but for the system architects and designers who configure systems. All and all, there are worse things in the world than a poorly configured RAID and tape system, but if your data is really important, then you need to think about the architecture globally.
|Drive||Native Performance in MB/sec||Compressed Performance in MB/sec|
|Quantum DLT 600A||36||72|
|Sun/STK T100000||120||360 (claims this performance with future 4 GB support)|
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 25 years experience in high-performance computing and storage.
See more articles by Henry Newman.