Disaster recovery testing is a multi-step drill of an organization's disaster recovery plan (DRP) designed to assure that information technology (IT) systems will be restored if an actual disaster occurs.
As part of a DR plan, companies typically hire a disaster recovery service.
Why is Disaster Recovery Testing Essential?
During a disaster, a natural or man made event interrupts normal IT functionality like data processing, communications, virtualization, and network and data center operations.
Research consistently shows that the loss of IT functions in a disaster leads to business failure. For instance, a full 93 percent of companies which lose their computer systems for 10 days or more because of disaster file for bankruptcy within 12 months of the event, according to the U.S. National Archives & Records Administration.
- Hurricanes such as Katrina and Sandy, earthquakes, floods, and tsunami are all potentially business-ending.
- Man made disasters can knock a business offline, include acts of terrorism, computer vandalism, sabotage, and inadvertent mishaps such as hardware misconfigurations and accidentally deleted files.
Disasters don't occur very often, but when they do, the effects can be devastating.
The main objective of DRT is to make sure that, in case a disaster does happen, the DR plan will actually work. A company's DR site will go live, IT systems will go back online with minimal downtime. Perhaps a company uses cloud-based DR, or DRaaS – in either case, DR testing reveals whether the backup is truly as foolproof as it needs to be.
Ongoing testing is a necessity, since the effectiveness of the DRP can be impacted by the inevitable changes to personnel, skill levels, and hardware and software architectures within an organization.
Fully testing your disaster recovery plan is an absolutely critical aspect of having a DR plan.
Disaster Recovery Scenarios
DR testing plans can help organizations prepare for just about any type of IT disaster, including the following kinds of scenarios, which have unfolded in real life.
- In an insider sabotage attack, a company disabled access for a software engineer just before firing him. However, the disgruntled employee had logged into the system from home earlier in the week and left his remote connection open. After the firing, he used this connection to delete several critical files from a manufacturing app. The company lost four hours of manufacturing time before being able to reload backup data and start up manufacturing again, says a study published by Carnegie Mellon University (CMU).
- In 2017, enterprises that included FedEx, Maersk, Merck, and many others fell victim to a ransomware-inspired virus called NotPetya. After its global shipping business ground to a halt, Maersk later admitted to taking a $670 million hit from technology cleanup, business disruptions, and lost sales. For its part, FedEx lost $400 million.
- In contrast, with advance warning of Hurricane Katrina back in 2005, the City of New Orleans managed to keep important business functions running without interruption during and after the deadly storm. The city downloaded critical systems such as financial management and shipped them in advance to an ACS data center in California. The city's web sites were moved from City Hall to a data center in Dallas operated by Red Carpet Host. Following Katrina, the city set up a backup data center in Austin.
Disaster Recovery vs. Business Continuity Planning
Disaster recovery planning and testing is a term often confused with business continuity planning (BCP). While DRP and BCP are closely related, however, they are not the same.
A DR plan and testing system specifies the steps an IT organization must take to recover systems that will meet the company's technology needs after a disaster.
A BCP, on the other hand, spells out what a business must do to make sure that its products and services remain available to customers. A BCP is made up of a business impact analysis, risk assessment, and an overall business continuity strategy. It is tested through a business continuity test (BCT).
Some organizations treat DRP/DRT and BCP/BCT separately, while others include DR within overall business continuity planning and testing.
5 DR Testing Techniques
Beyond restoring data and keeping critical applications and services online during the emergency, DR solutions should include ways to alert staff about the disaster and to allow communications during and after the event if regular phone lines and networks go down.
In the planning and testing process, DR teams should also recognize that, despite the disaster, the organization must continue to meet its security and regulatory compliance obligations.
Five types of DRTs are used to test disaster recovery solutions:
- Paper test: In a paper test, members of the DR team read and annotate recovery plan documents such as DR policies, procedures, timelines, benchmarks, and checklists. A hard copy of documents should be stored in a secure offline environment, and a digital copy in the cloud.
- Walk through test: A walk through test is a group walk through of the DRP to pinpoint any issues that need to be addressed and any modifications that should be made to the disaster recovery environment.
- Simulation: In a procedure somewhat along the lines of a fire drill, teams practice the DRP in real life to make sure that it's sufficient for IT disaster recovery.
- Parallel test: In a parallel test, failover recovery systems are tested to make sure that, in case of disaster, they can perform real business transactions supporting key processes and applications. Meanwhile, primary systems continue to run the full production workload.
- Cutover test: A cutover test goes further to test failover recovery systems built to take over the full production workload in case of disaster. Primary systems are disconnected during the test.
Six Disaster Recovery Testing Levels
In parallel and cutover testing, IT systems can be tested at differing levels of comprehensiveness. IT organizations vary as to levels of testing performed, as do DR service providers.
This level of testing checks that blocks / files are good after they've been backed up, but does not ensure the applications can be functionally recovered.
Database mounting verifies a that a database has basic functionality within backups.
Single Machine Boot Verification
Single machine boot verification verifies that a single server can be rebooted after it's gone down.
Single Machine Boot with Screenshot Verification
This test sends an image of the operating system to administrators as proof that a server can be rebooted. However, it does not prove that the server will still be functional to the business.
DR Runbook Testing
Involving multiple servers, DR runbook testing is used mainly with multiple machines which deliver a business service together, such as clustered database or enterprise resource planning (ERP) systems.
The highest level of testing, recovery assurance encompasses multiple machines, deep application testing, service level agreement (SLA) assessment, and analytics as to the reason why any rollback to system recovery failed. Some but not all DRaaS providers offer recovery assurance testing.
Disaster Recovery Testing Best Practices
Test regularly and thoroughly
Some large organizations do DR testing on a quarterly basis. Yet despite publicity around disaster recovery lessons learned, 23 percent of businesses never test disaster recovery, whereas about 33 percent test once or twice a year. Further, out of the companies that do test their DRPs, about 65 percent fail their own DRTs, according to one survey.
While the frequency of testing will depend on your business and its DR readiness, experts strongly advise doing a full test at least once per year.
Set Measurable Benchmarks
For critical applications, set RPO and RTO (recovery time objectives and recovery point objectives), which are measurable on a scale. The purposes of these benchmarks are to make sure you're reaching your objectives while also detailing the processes accounting for success.
Some industries, including health care, require organizations to know and document their RTOs. Regardless of which industry you're in, by using benchmarks that are measured on a scale, rather than just pass/fail, you're better equipped to identify DR procedures which need improvement.
Keep DR Team Members on Their Toes
Clearly define all individuals responsible for researching, developing, implementing, and testing the DRP. Assign a backup person for each role in a DR exercise in case the designated individual is out of the office. Share the DRP and DRT with all team members.
If team members leave the company, make sure that their replacements are trained on DRP and DRT policies and procedures. Then arrange for a group run-through of the DRT to smooth out disaster recovery processes.
Work with a DR Partner If You Need One
While big organizations have the internal expertise on hand to perform DRT themselves, many smaller companies turn to DR companies for assistance.
Beyond the multifaceted DRaaS, disaster recovery service providers offer specialized services such as ongoing testing and 24/7 performance monitoring of customers' DR solutions.