Data de-duplication is one of the hottest technologies in storage these days, and users and vendors alike are climbing on the bandwagon. There are vendors building hardware products, others building software products, and some doing both.
As usual, I am not going to compare products or different vendor technologies, but I am going to look at an important issue you need to ask your vendor about if you're considering purchasing data de-duplication hardware or software, and that is data corruption. You might wonder what de-duplication has to do with data corruption, and I'll get to that in a minute. But it's important to note that I'm writing this article from a generic hardware and software point of view. Some vendors' products may or may not address all or part of the problems I will discuss in this article. It's up to you to understand what you are buying and to ask the vendors the right questions. Caveat emptor.
A Trip Down the Data Path
Some of you might have read an article I wrote on a data corruption experience I had (see When Bits Go Bad). I compared a few bits and the ASCII characters had changed dramatically; in fact, most of the bytes went bad in the example I gave.
The point of the article was that bits occasionally go bad, sometimes sooner than later. It does not matter if it is high-end enterprise Fibre Channel, which might happen far less often than cheap SATA. It might not even be the drives or the controller, it could be that the memory of the machine corrupted the data or the CPU or something else. The bottom line is that at some point your digital data in the digital world will be corrupted. Although the likelihood varies based on the operating system, the hardware and the software, it can happen even on IBM mainframes running MVS, although the potential is far lower than any other system given the amount and number of parity and checksums calculated and checked.
A Swiss Laboratory last year published a paper on data corruption and its sources that is worth reading.
You might wonder what all this has to do with data de-duplication. In a nutshell, if you de-duplicate your data and the hash area for the data de-duplication hardware or software gets corrupted, you can lose all of you data. If you're going to get rid of duplicate data, it's critical that the data you have be right.
For example, what if the data comparison hash was data that was corrupt at the time the data was read, but the data on the disk is still good? If you read it again, you will likely get the correct data. But what if the hash data written on disk was bad or went bad, would you still be able to read your files? Let's step through these two examples and see what happens. As a reminder, I am doing this generically and the examples might or might not work for a set of vendors based on their hardware and software.
Case 1: Corrupted Data Read
If you read data from a disk and the data you read was corrupted for any reason (disk drive, channel, controller, or other reason) and then started to apply the corrupted data to new data, you would have a major problem. When you read the information again from disk to de-duplicate it, it would not be the same.
If you compare the data that you read with the incoming data, the data in memory will be bad, so any data that you find a match with will be compared with data that will be different the next time it is read. So basically any new data from the point of the data read with the corrupted read will be compared incorrectly and therefore be unreadable.
If the hash is reread then for some reason and is read correctly, any subsequent data read will be just fine. Other than that, it will be a debugging nightmare, one which I am pretty sure is unrecoverable and a significant amount of data will be lost. The scary part is that some of the data is good and some of the data is bad, and figuring that out is likely not possible without some serious detective work.
Case 2: Corrupted Data Hash Data
What if the data on disk gets corrupted and is bad from the start? This is a similar problem to the first case, except that with Case 1 you have good data, then bad data, and then likely good data. With this case, the hash that was created is in memory and is good, but the hash on disk is bad. That means you have data that was created with a good hash, but once the hash is read from disk, the data will be bad. The good news, if there is any, is that once the hash is read from disk back into memory, it will be the same, so the problem should be limited. But you will have data you create that cannot be un-de-duplicated for the time period that the data was created with the original in memory hash. So when you go to un-de-duplicate the data months or years later, you will have bad data until you re-read the hash from disk and then have good data from that point on. Again, this is a debugging nightmare and likely impossible to figure out.
What You Need to Ask Vendors
I am a firm believer in the reality of undetected data corruption. It has happened to me and I have seen it happen to others, and sooner or later it will happen to you. I am also a firm believer in the new T10 Data Integrity Field standard, which passes an 8 byte checksum from the host to the disk and has the disk confirm the checksum, which should be generally available from a number of vendors likely later this year. I personally like this standard, as some of it is implemented in hardware in the data path, including the disk drives, and is from the same people that brought you the SCSI protocol.
There are file systems that do checksums, but if a file system is doing checksums and correcting the data, then you have two issues:
- The file system must read the data back to the server before the checksum can be confirmed or rejected. It is not checked when the data is written to the device by some of the hardware in the path.
- The server CPU must calculate the checksum and also confirm it when the file is read back in. There is a significant effect on the server doing all of this checksum activity. This includes increased memory bandwidth requirements and utilized CPU caches, requiring applications to potentially reload from memory and memory bandwidth usage to increase by the checksum calculation.
This is an issue if you are running applications that use significant server resources.
There are products that have their own file systems and checksums and address some of my concerns about data corruption, but not all vendors have products that have this functionality built into their offerings. This is just one of the areas that you should be concerned about with data de-duplication. It should not be the only consideration for the evaluation of a vendor's offering, but it should be one of the high-priority considerations. Vendors might say that this is your problem when you ask the question, and that your environment should be running something like T10 DIF. Wrong answer. Vendors need to be thinking about your hardware and software before you ever ask a question, and if they leave the problem to you, then I would be running the other way.
Data de-duplication is a great tool for some environments, but as with everything complex, it requires some careful planning and execution.
Henry Newman, a regular Enterprise Storage Forum contributor, is an industry consultant with 27 years experience in high-performance computing and storage.
See more articles by Henry Newman.