Disaster recovery plan development best practicies or resources?

Solution 1:

Make sure you have a emergency contact roster. aka a Recall Roster

It should look like a tree, and show who contacts who. At the end of a branch, the last person should call the first and report anyone who could not be contacted.

(This can be co-ordinated through HR, and used for any type of disaster)

Solution 2:

An excellent source of information is Disaster Recovery Journal (about).

Community resources available include the current draft of their Generally Accepted Practices (GAP) document, which provides an excellent outline of the process and deliverables that constitute a solid business continuity plan and process. Also available are several white papers covering various DR/BC topics.

The process seems daunting, but if approached systematically with a good outline of where you would like to end up (like the DRJ GAP document), you can ensure that you optimize the time invested and maximize the value of the end product.

I find their quarterly publication to be interesting and informative as well (subscribe).


Solution 3:

If we add our ideas we could create a nice wiki from this post once everyone has had added their own ideas. I understand there's bunch out there to follow, but some of us have specific priorities when it comes to recovery. To start, here's mine:

Make sure you have off-line/remote documentation of your network


Solution 4:

With DR the basic things are your RTOs (Recovery Time Objectives) and RPOs (Recovery Point Objectives), which roughly translate as "how much time is acceptable to have to spend getting it back, and how much data can we afford to lose". In an ideal world the answers would be "none and none", but a DR scenario is an exceptional circumstance. These really should be driven by your customers, but since you're starting from the IT angle you can make best guesses, but be prepared to adjust up or down as required. Aiming for as close to "none and none" as you can reasonably get is good, but you'll need to be able to recognise when the point of diminishing returns gets in.

These two factors might be different at different times of the year, and different on different systems.

I like the more well-rounded approach; it's tempting to list out the events that can lead to a DR scenario, but these really belong more to a risk ananlysis/mitigation exercise. With DR the incident has already happened, and specifics of what it was are less relevant (except perhaps in terms of affecting availability of DR facilities). If you lose a server you need to get it back, irrespective of whether it was hit by lightning, accidentally formatted, or whatever. An approach focussed around scale and spread of the disaster is more likely to yield results.

One approach to use on customers, if you find that they're reluctant to get involved, is to ask them DR questions from a non-IT angle. Asking what their plans are if all their paper files go up in flames is an example here. This can help with getting them more involved in the broader DR thing, and can feed useful info into your own plans.

Finally testing your plan regularly is crucial to success. It's no good having a beautiful DR plan that looks great on paper but that doesn't meet it's objectives.


Solution 5:

Actually, the "single incident" development model is a good idea, as the first step. One reason is that is makes the planning exercise more realistic and focused. Plan for the flood, all the way. Then suppose a different incident (say, long term power outage), apply that plan to it, and fix what breaks. After a few iterations, the plan should be relatively robust.

Some thoughts ... - be sure to account for unavailable people. If there is a flood, you can't assume that all relevant staff are available. Someone might be on vacation, or injured, or dealing with their family.
- plan for communication problems and weaknesses. Have multiple numbers and multiple modes.
- the DR plan needs a chain of command. Knowing who makes decisions is critical.
- the plan needs to be widely distributed, including offsite and off the grid. It needs to be accessible during the disaster!