Disaster Recovery and Continuity for the Database Administrator
The most important information in most businesses can be found in the database. A lot of time and attention goes into planning for any new database application. Storage, servers, high availability, capacity, and clustering are just some of the considerations.
The same planning process must take place for disaster recovery and business continuity planning of databases. All actions taken to make business critical applications available must be methodical and deliberate. Disruptions are serious events and should not be taken lightly. "It's not about seeing the [recovered] data on your screen, but conducting business," as a 2006 Journal of Financial Planning article put it. Databases that are at the heart of the business today fall squarely on the critical path of the disaster recovery actions taken when a disruption strikes.
Partial or complete disruptions of a business can be devastating. Business continuity planning can ensure that capacity is available for critical business operations in the time of need. Practiced professionals in the area of business continuity understand that life and opportunities can continue after a disaster. Understanding the steps involved with keeping a business viable is where some planning is needed.
Destruction of assets can be devastating. Insurance may cover the expense to replace those assets, but it will not put a business back in place overnight. This takes a huge mental and physical toll on workers. These conditions create burdens and stress on employees and their customers. Without a disaster recovery plan in place, there is little hope of ever getting a business back on its feet.
One of the first things needed are the requirements for each database supported. Recovery times are probably the most important of these requirements. The difference between a few seconds of downtime and a few minutes of downtime can be quite substantial. Some business units may have a tolerance for a few hours. This must be known for each database for your plan to be effective. "...[Y]ou have to prioritize what you need in order to function... you have to figure out what is actually mission-critical," wrote Charlene O'Hanlon in a 2007 T H E Journal article.
Another important answer needed is in reference to data loss. If little to no data loss is acceptable, then a disaster recovery solution can become a budgetary concern. If the backup from last night will suffice, then this can lead to major cost savings.
Capacity can be a concern at the disaster recovery site. Customers should be asked about performance degradation and what is acceptable. This can be a tricky question to answer, and customers will usually need assistance to figure it out. If left to themselves, they will almost always answer that no degradation is acceptable.
Another question that should accompany performance degradation is finding out about the number of users that will be accessing the system during the disruption. These two answers will help to identify a more accurate capacity. What should be explained is that during the disruption, the entire corporate population may not need access to the enterprise application. Possibly only power users may need the system to run business critical functions for the enterprise.
One example is Human Resources applications. An HR application may be available to the corporate population during normal operations for viewing pay stubs, updating W-2s, and so on. During a disruptive event, these rights could be suspended but power users could continue to run payrolls, enter benefits, hire and fire employees, and the like. It is possible that far less capacity is needed than originally thought necessary, which can mean more databases on the same servers, as long as the databases will not interfere with one another's processing. Virtual servers can be used as well.
"... [Y]ou would re-instantiate the virtual machines at a higher ratio (density) of virtual-to-physical. Consequently, organizations that can tolerate a slight drop in performance can build a much cheaper secondary data center to handle temporary disruptions," according to a Nemertes Research report by Andreas Antonopoulos.
Accessing the databases and applications is another important matter. If the primary place of employment is no longer habitable, employees will need a place to go for office space and workstations. Workstations will need to be equipped with necessary software for database connections. This important point must not be overlooked.
Testing is very important. Determine the frequency in which you will need to test your disaster recovery plans. Only through testing of the plan can issues and problems be discovered and corrected. Testing can also bring opportunities to make improvements to the disaster recovery plan.
Since nothing stays the same in business very long, you will find the same quality in disaster recovery plans. To keep them relevant and up-to-date, testing must become a regular occurrence. Testing may occur yearly, twice per year, or quarterly. The more practical experience individuals can get with the disaster recovery plan and the disaster recovery site, the better off everyone will be during a crisis situation. Familiarity will build confidence in individuals and the equipment and systems they are working on.
Usually, disaster recovery setup is not an emergency. The emergency only comes during execution of the plan. Still, a timeline should be put in place when planning disaster recovery for databases. It is unfortunate that many times, other projects push disaster recovery to the back burner. Make disaster recovery part of all projects so that it can be completed in a timely manner.
Moving back to the primary site will be a joyful time. It can also be quite hectic, since it needs to be done quickly. No one wants to stay at the disaster recovery site any longer than they have to. Plan the return much as would be done with the go-live of a new application. Plan the downtime, migrations, testing, go/no-go decision and fallback procedures. Everything should be scheduled and users made fully aware of the outages and changeover schedules.
There should be someone, or some people, in the organization who will make the decision that a disaster has struck and failover should now take place. Determine who that person is and how the information will be communicated. Ideally, the information will be distributed in multiple forms. Rarely in a disaster will all the normal lines of communication be available to an organization.
Part 2 of this series can be found here.
Kevin Medlin has been administering, supporting, and developing in a variety of industries, including energy, retail, insurance and government, since 1997. He is currently a DBA supporting Oracle and SQL Server, and is Oracle certified in versions 8 through 10g. He received his graduate certificate in Storage Area Networks from Regis University and he will be completing his MS in Technology Systems from East Carolina University in 2008.
Article courtesy of Enterprise IT Planet