A disaster recovery plan is an insurance policy, of sorts. Your business needs a DR plan because a well-implemented disaster recovery plan will make your IT infrastructure whole when disaster strikes.
More than an offsite data center and a collection of tools for data recovery and getting your systems back up and running, disaster recovery—often shortened to DR—also encompasses the policies and procedures that your organization's IT workers should follow to successfully get your business back on track.
As any seasoned IT pro will tell you, disasters can take many forms. And they don't necessarily have to rise to the level of a data center-rattling earthquake or the storm of the century.
Sure, nature is responsible for its share of hurricanes, blizzards, floods, wildfires and countless other ways to interrupt a company's IT operations. But in terms disaster recovery, people and all their foibles can fall into the same category.
Human error, improper configurations and cyber-attacks can all cause servers and other IT equipment to fail. Sometimes a disaster can be traced back to a faulty server rack, a buggy application and other mishaps.
When it comes time to craft a disaster recovery plan, IT personnel must document its scope and objectives. While they may vary depending on the severity of a disaster and between organizations, even those operating in the same industry, the documentation should be clear on what it covers—from a modest fleet of desktop systems to massive data storage archives—and how the steps described therein help meet an organization's data recovery objectives and other goals.
Effective disaster planning for includes tying together many elements of the storage ecosystem.
Although disaster recovery and business continuity are sometimes used interchangeably—and they are indeed related—they serve very different purposes.
Disaster recovery is a subset of business continuity. Whereas disaster recovery is generally focused on a company's IT operations, business continuity involves the entire business or at least those functions that are critical to its ongoing operations.
A business continuity plan includes policies, procedures and contingencies that can be used to continue conducting business in the event of a disaster or other disruption. It takes more than a company's data and IT systems into account, reaching other areas that are typically outside an IT department purview, like office space, suppliers, employees and industrial equipment.
Considering the integral role of IT in today's modern enterprises, disaster recovery can be considered a vital component of a business continuity plan.
Your disaster recovery plan will be heavily influenced by the IT systems and services that your business relies on. Although there is no one-size-fits-all approach, here are some factors to consider.
- Virtualization Disaster Recovery
One of the benefits of virtualization is that it can eliminate the need to recreate a physical server when something goes wrong. Placing a virtual server on reserve capacity or the cloud are very real possibilities, making achieving your organizations recovery time objectives (RTOs) trivially easy in some circumstances.
Take stock of the virtualization platforms (VMware, Microsoft Hyper-V, Oracle VM, Citrix XenServer, etc.) used in your environment, along with the backup and recovery tools used by each, and draw up a plan to get virtual workloads up and running again.
- Network Disaster Recovery
Servers aren't the only part of an organization's IT infrastructure that may be affected by a disaster. Networks can also meet an untimely demise, which in turn can lead to failures in business applications and services that depend on reliable network connectivity.
A network disaster recovery plan often includes procedures on contacting the proper IT personnel, acquiring replacement networking equipment from vendors and other actions required to restore connectivity.
- Cloud-based Disaster Recovery
One of the most compelling reasons to include the cloud in your disaster recovery planning is the ability to use a cloud provider's data center as a recovery site without investing in additional facilities, systems and personnel. It also grants users access to cutting-edge IT capabilities, a consequence of a competitive cloud market in which AWS, Microsoft Azure, Google Cloud and others attempt to one-up each other.
There are several factors to consider before making the jump to disaster recovery as a service (DRaaS), including bandwidth, cloud storage costs, security and regulatory compliance, to name a few. As with any IT endeavor, identify the backup and recovery challenges that a third-party cloud provider may help solve along with the impact on your IT processes and budget before incorporating the cloud into your disaster recovery plan.
- Data center disaster recovery
A disaster recovery plan extends well beyond the IT systems housed in a computing facility. It involves the building itself, utility providers, backup power, physical security, fire suppression, HVAC (heating, ventilation and air conditioning), support personnel, and much more.
Preparing for a data center outage, outright damage to a data center building, intruders and other risks is essential and often requires the input of your company's IT teams, facilities management personnel and physical security experts.
Now it's time to get started creating a DR recovery strategy.
- Complete a risk assessment
A key step is to create a risk assessment that details the likelihood of a disaster and the risk it poses to your organization. This will help you prioritize your efforts. Although you may rarely face a hurricane or feel the earth tremor, there's a good chance that you'll experience a server failure or a cyber-attacker targeting your network. Plan accordingly.
- Collecting data and organize a plan
Gather the necessary information needed to create a disaster recovery plan. This may include an inventory of your servers and storage systems, network diagrams, data center blueprints and floorplans, key personnel, emergency contact numbers, backup and recovery procedures and workflows, third-party services, support information and more, depending on your specific setup.
Now comes the most crucial part: documenting it.
Assuming you haven't hired a disaster recovery consultant and are using in-house personnel to write a disaster recovery plan, it may be helpful to search out a disaster recovery template from an authoritative source. Even if you deviate from the template, it will acquaint you on how to use a methodical approach to problem solving, write for an intended skill level, avoid glaring omissions and provide an actionable, step-by-step guide of your own.
- Test your disaster recovery plan
After that's done, it's time for disaster recovery testing.
Like fire drills, where employees file out of an office building as if it were on actual fire, disaster recovery testing simulates an IT mishap. However, pulling the plug on a critical server or cranking up the heat in a data center is generally ill-advised. Luckily, there are other ways to determine if your plan will work.
Disaster recovery testing can involve tabletop tests where recovery procedures are discussed and evaluated without physically taking the actions described in the document. Businesses can also conduct hands-on technical tests where participants are tasked with restoring a system, helping them gauge their preparedness.
Finally, your disaster recovery plan should be a "living document," of sorts.
Routinely update it to account for changes to your infrastructure, technology updates, mergers and acquisitions and the many other factors that affect your IT environment. Be sure to update your testing procedures after significant changes.
And don't forget your most valuable resource: people.
Identify employees that will be put in charge when a crisis erupts and match skillsets to the affected systems and technologies. Remember to keep your employee information current—the best laid plans will fall apart if your workers are left scrambling to find someone who can help.
Technology, though, is only piece of the DR site puzzle. Successful recovery requires the people, processes and governance, as well as a good testing culture to ensure recovery readiness. This should be dictated by an overall Business Continuity Plan (BCP) that covers areas such as incident assessment, crisis communications, management, and more.
“IT DR plans are a subset of a BCP and should include the strategy and steps/runbooks to ensure successful recovery,” said George. “Increasingly, runbooks should be orchestrating automated recovery steps and procedures to ensure higher success and reliability.”
Within the DR plan, it is vital to look beyond the data to encompass applications and their dependencies related to metadata, security settings, certificates, keys, configuration, and licensing. The last thing you want is to fail to recover due to neglect of these factors.
Staffing should also be part of the plan. Some have devised beautiful DR plans only to fail as staff resources couldn’t arrive at the disaster recovery site.
A good DR plan should include the following:
- An RTO and RPO assessment established by leadership to define the maximum downtime allowed and the maximum data loss that can be tolerated. Any DR plan should accomplish the intended RTO and RPO.
- A communication and notification plan. i.e. who needs to be notified and how (you can’t assume email is the best method since the email system could be impacted by the disaster).
- A roles and responsibilities plan defining who is responsible during the outage.
- A critical systems inventory that associates the IT systems that need to be protected with a DR plan.
- Special attention should be paid to legacy systems that may be difficult to get additional hardware on short notice.
- Security and compliance need to be factored into a DR site. A company should not have a reduced security posture while operating in DR mode. This could allow hackers or rogue agents to steal data or insert ransomware viruses.
The old saying that practice makes perfect very much applies to DR planning and a BCP. A thorough test or dummy run of a disaster is a good way to flush out bugs in the process. This will bring to light factors such as faulty procedures, lack of documentation, incorrect configurations, and unassigned responsibilities.
“Regularly simulate various disaster and outage scenarios,” said George.
Greg Arnette, Technology Evangelist for Barracuda Networks, added that a common glitch in DR planning is out-of-date software. Therefore, any test should verify that all patches are current, and any needed updates have been applied.
“Outdated and stale software dormant in the DR systems will cause problems when it’s time for the DR systems to come online during a disaster,” said Arnette. “A disaster is the worst possible time to worry about applying accumulated patches.”
A "do once and be done" approach to testing is another way to fail. Personnel turnover and technology upgrades make any BCP obsolete within a year or so. Regular testing shows up weak points due to personnel, technology and environmental changes. Plans become stale over time and need to be refreshed. Staff will come and go -- always keep your DR plan up to date.