Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
What happened to the really cheap archive storage we were expecting? Can't we get archival storage for $0.25/GB?
The answer is that you can get archive storage close to that price, but you run risks in doing so for very large archives. To drastically reduce the risks, you need to make two or three copies. The $0.25/GB archive suddenly becomes a $0.50/GB or $0.75/GB archive. But these are just the hardware costs.
You need a file system of some type to store the data. At the low end of the spectrum, you could just create 2-3 different pools of storage using freely available file systems. Then you could put the archive data on one copy and use rsync to make sure that the data is copied to the other pool(s).
But there is a problem when this approach. What happens if you encounter a hard read error (bad sector)? This triggers a RAID rebuild and the associated issues. You can use the other copies to restore the data, but that has to be programmed. And you have to be sure the remaining copies are correct (i.e. no bit-rot). In other words, you will have to do all of the programming and maintenance yourself. But all of this is free, right?
At the other end of the spectrum there are file systems, many of which are proprietary or have commercial support, that handle all of this work for you. In these file systems, data is copied to ensure that there are two to three copies distributed across the storage pools. In the event of a hard read error, the file system can read one of the other copies of the data while doing the rebuild in the background. Once the rebuild is done, it will then check the rebuilt data to the other copies. But again, the system is reading data so we're increasing our chances of a hard read error again.
All of this takes a great deal of work that happens in the background while you aren't watching. Having one of these file system can greatly reduce your work load.
The Final Word
Archives are not as simple as they appear. You have to ask yourself about the purpose of the archive and the projected amount of data that will need to be stored. Most importantly, though, you need to ask yourself about the importance of the data in the archive. Answering this simple question can have a very large impact on the economic realities of the archive.
I apologize for bursting any bubbles, but you cannot have large archives on spinning disk with only one copy of the data and expect that data to always be there. The hard error rates in the previous tables illustrate this quite clearly.
If you really don't care if some data becomes unreadable, then you purchase enough hardware for one copy of the data, perhaps getting the low price you expected.
But if your data is very important and you are worried about possible unrecoverable reads, possibly more than one, you will need to have multiple copies of the data. This also means you will need more hardware than you thought. For example, if you want a 1 PB archive you will need 2 PB, 3 PB or more, of capacity. If 1 PB of storage hardware costs $0.25/GB then to store three copies of the data will cost $0.75/GB.
This simple example of the impact of hard read error rates and the number of data copies it implies, is an economic reality that many people don't want to face. The expectation is that archive data has small performance requirements so it should be inexpensive. The reality is that if you want large archives on spinning disk and you want to be able to read them in a timely manner and reduce the risk of losing data, then you will likely need one or more copies of the data. This costs more than you expect—but that's the reality of archive storage.