Download the authoritative guide: Enterprise Data Storage 2018: Optimizing Your Storage Infrastructure
First, a some background on Enterprise SSD DWPD.
Drive writes per day (DWPD) means how many times you can completely rewrite all the data on your SSDs within a 24-hour period. All enterprise and most consumer SSD vendors I could find online give specifications for how many DWPD can be done on their SSDs, or the drive endurance in GB.
Either way, you can calculate how much data you can write and still be within the vendor’s warranty, but how do you determine in advance before you buy the SSDs how much data you are going to write a day?
Because if you underestimate your data writes, you can potentially have all your storage fail at nearly the same time, which we all know will cause data loss. Let’s break down the components to see how much data you write per day. The way I see it there are three inputs:
1. User applications
2. System applications
3. Storage device and system overhead
If your storage system is running on, say, a RAID storage target, you can use iostat or sar to monitor the amount of data written to the target. Most warranties for SSDs -- and for that matter, disk drives -- are five years. Seems pretty simple, but things change over time. Take the following example. Let’s say you have a 3 TB SSD that supports 1 DWPD and your application load from iostat and sar show that you are writing only 1 TB per day, tracking data over a one-month time period. You can buy SSDs with far more than 1 DWPD, but I am just using this as an example for understanding the math.
So for a simple case where nothing changes, you would have this:
TB Per Day
Total TBs of Writes left on the drive
So after 60 months, you have lots of available full device writes a day left on the device, as your workload did not change.
Now on the other hand, what if your write workload increased four percent a month every month for the next five years?
Even though your architecture was designed to be significantly below the required write threshold, with only 1 TB a day be written and the device designed to support 3 TB a day, you’d be surprised what compounding that four percent will do over time. You will see from the below table that it would have a dramatic impact on the total number of TBs that are available to be written, and over a five-year period you will write more than the SSD can support.
TB Per Day
Total TBs of Writes left on the drive the Drive
After 54 months, you have written more than the drive can support. If your percentage increase goes from four percent a month to say six percent a month, SSD will likely fail and you are out of warranty at month 43, which is only about three and a half years. Remember your starting point was only a third of the full drive writes per day.
TB Per Day
Total TB Writes left on the drive the Drive
The system applications, such as the operating system and logs, might have an impact also. You might monitor your system and not see much going on, but you might load a new operating system, or have in increase in logging requirements or some other operation which requires an increase in the amount of data that will written.
For system applications, it is very likely that data must be monitored over a significantly longer timeframe, as the impact on system applications and logs is often very dependent on the activities in the system. In an SELinux environment, where everything is logged, it becomes very important to understand the number of users and their activities.
Storage device and system overhead
Looking at the iostat and sar data can give you an idea of how much data is moving from the servers to the storage target, but in a RAID environment, that is not the whole story for some RAID levels. With RAID-1 (mirrors), you are not going to have to read-modify and write issues, which is common on RAID5/6 implementations when you are not aligned with the internal RAID stripe.
So knowing how much data was written to the storage target is likely the amount written if it is aligned, but you might be writing more data depending on the RAID system and allocation if it is not aligned. In addition, if there is an SSD failure you are going to have to rebuild, which most likely will take one full drive write away from the new drive.
Developing an architectural plan for SSDs that is going to last five years, which is generally the warranty for enterprise SSDs, is significantly more complicated given that you have full device write specifications. It should be noted that after looking around on the web for a bit, most disk drive manufacturers are also specifying hard drives in TB per year, which can easily be translated into full device writes.
So the world is changing, and for the most part, the information available and capacity management tools are not up to the task. This is not a surprise, given that we have been solving problems with by throwing hardware at them since the mid-1990s rather than spending the time and effort to develop architectural plans based on the manufacturer’s specifications.
People have been saying for almost 20 years that it is cheaper to buy hardware than it is to monitor systems and plan for the future. That might be the case, but data loss might depend on monitoring your systems, because in a RAID world the devices will likely fail at nearly the same time. I have always been a proponent of monitoring systems, capacity planning and the like, and it might be time to reconsider this because your capacity planning tools are going to be needed in an ever more complex system architectural world.