Both tape-based and disk-based archives are growing at tremendous rates that are exceeding the density increases in storage technology and the reliability of storage. Humans are pack rats and the amount of data we save keeps growing, and this is unlikely to change. Part of the reason is that we do not know when data becomes unimportant, and becomes important again, and there is no standard framework (see Data is Becoming Colder, by Jeff Layton).
Since we do not have the tools to know when and if we should delete data, we archive everything. This is one of the reasons we have more and more data that is archived and must be protected. Protecting files in a large archive requires generating a checksum for each file, as well as regular validation of every file in the archive to ensure data integrity. When a checksum is invalid, you need software that uses a secondary valid copy of the file and then replaces files that are corrupted with valid copies.
I was recently talking with a customer who has large preservation archives and I stated that checksum verification can turn your archive problem into a high performance computing (HPC) problem. My definition of a preservation archive is an archive where the stated goal is that the information in the archive must be bit-wise exactly the same forever unless the file was rewritten for something like a format change (e.g., PDF 1.3 to 1.5).
The customer paused and asked why, and as I started to explain the answer he suggested that I write an article on the subject. It dawned on me that large preservation archives require significant amounts of computational power, memory bandwidth, PCIe bus bandwidth and storage bandwidth, and this is not much different architecturally than HPC computing, which is very computation- and I/O-intensive.
Today, many preservation archives are well over 5PB and a few are well over 10PB with expectations that these archives will grow to more than 100PB. With archives this large, the requirements for HPC architectures for checksum validation are not much different than many of the standard HPC simulation problems, such as weather, crash, and other simulations.
Computational Power
Most HPC problems require large numbers of floating point operations, but some problems – such as genetic pattern matching — also require significant integer performance. In large archives, checksums should be validated regularly; how regularly depends on the quality of the hardware and the amount of data, but even good hardware can go bad and corrupt your data.
Some archive systems use commodity hardware, which has well-known reliability issues including, but not limited to, memory without parity, low-end network adapters, and consumer-level disk drives that have much higher silent data corruption issues compared to, for example, ECC memory, high-end RAID controllers with SAS disk drives, enterprise-level tape systems, etc. Checksums must be regularly validated and checksum algorithms must be robust, which requires significant computational resources.
To validate the checksum for a file, the whole file must be read from disk or tape into memory and have the checksum algorithm applied to the data read and then compare the checksum that was just calculated to the stored checksum, which should be checksummed also so you are sure that you have a valid checksum to compare to the file you read into memory. With large archive systems, this is often an ongoing process whether the data resides on disk or tape, but checksum validation is particularly critical for disk-based archives with consumer-grade storage.
Memory Bandwidth
HPC problems almost always involve CPU cores that are waiting on memory requests. In fact, some people have jokingly said that this is the definition of an HPC problem. Similarly, checksum calculations require significant memory bandwidth and will have idle cores. Since the whole file must be reading into the core and have the checksum algorithm applied one time, there is no data reuse in any of the caches as the file streams through the caches until it reaches the core to be processed.
You would think that most of the memory bandwidth would be used reading data into memory; since all of the files reside on disk or tape, these files must be read into memory. This actually becomes a write from the PCIe bus into memory and then a read from memory into the core to calculate the checksum. So for checksum calculations, memory usage in terms of reads and writes is nearly 50/50, as files are written into memory from the PCIe bus and read from memory into the cores and processed. Of course, at the end of the process the checksum must be compared to the originally generated checksum.
PCIe bandwidth
The PCIe bus is likely the most critical element of the system architecture given that historically many PCIe buses do not run at the rated performance. With most CPU architectures today, memory bandwidth is at least 2X or even 8X or more than the performance of the PCIe buses, and memory bandwidth is slower than the CPU performance. This means that PCIe bandwidth is critical for the checksum calculations. Buying machines with poor PCIe bus bandwidth will limit the checksum verification speed because you need to get the data into memory.
With the PCIe 3.0 standard recently ratified, you can expect to see PCIe 3.0 systems later this year. This will help by doubling PCIe performance, given the significant increase in memory with the latest generation of technology from vendors such as AMD, IBM and Intel. The problem is that PCIe 3.0 only doubles performance over PCIe 2.0, while memory bandwidth increases have gone up at a far greater rate. This imbalance is an issue that impacts how much data you can read from storage and validate checksums.
Storage Bandwidth
Storage bandwidth is the long pole in the checksum validation tent given that storage performance has not kept pace with either PCIe bandwidth or memory bandwidth. Though flash technology has much higher bandwidth than rotating storage, it is not cost effective for large archives.
Storage resources must be able to read the data at a reasonable rate. Say you have a 10PB archive and want to validate checksums every 30 days. That would require just over 4GB/sec of bandwidth (10PB/(30*24*3600), and that 4GB/sec of bandwidth does not include ingest and file recalls from users. This means that storage systems must be able to read at 4GB/sec from disk or tape into memory. Clearly, validation every 30 days is not practical given the high cost, but the validation requirements — and how often you want to validate your archive — must be designed into the architecture and should be a major architectural consideration.
Final Thoughts
The reliability of digital format can be impacted by many factors, from things such as bit rot to bad hardware to the statistical probability of a silent data corruption based on standard channel error rates and other factors. Checksum validation is critical in keeping the archive data valid, as is having multiple copies. Using robust checksums improves the validation process, but increases the computational requirements.
The key is to have a balanced system that meets the requirements for checksum validation, ingest and access. Balancing CPU, memory, PCIe and storage bandwidth is often a difficult part of the architectural planning process.
The only difference between large archives and large HPC problems is the network interconnection between the nodes, which in the case of HPC is usually InfiniBand given the need for high performance and low latency.
Large preservation archives could benefit from some of the architectural techniques developed in designing HPC systems. With large archives you can not expect the data to come back the same years later without regular checks of the data, which reduces the probability that more than one copy of the data will be corrupted.
Henry Newman, CEO and CTO of Instrumental, Inc., and a regular Enterprise Storage Forum contributor, is an industry consultant with 29 years experience in high-performance computing and storage.
Follow Enterprise Storage Forum on Twitter.