|
October 2001 This Is Not a Test For those of us in the United States entrusted with a companys information resources, the events of September 11 changed everything. Before our business continuity or disaster recovery plans were primarily concerned with so-called acts of God. But we must now plan for the most improbable human acts imaginable. Who among us, prior to September 11, had a plan that took into account multiple high-rise office buildings being destroyed within minutes of each other? As you read this, the insurance industry is revising its assumptions. Likewise, we must now reconsider our approach to managing and protecting the assets for which we are responsible. Never before has the probability of actually needing to execute our recovery plans been so great. As of this writing there have already been numerous business continuity and disaster recovery articles in the computer press. By now we understand the distinction between keeping the business going not just IT, but also the whole business and recovering after some (hopefully minor) interruption. And weve covered the issue of risk, where all the trade-offs and costs are negotiated. This whole topic was explored anew in the last few months, but it is still worthwhile to emphasize some early lessons of the attacks, from which we are still recovering. It Had Better Work Worst Practice 1: Trying to Fake It I was visiting a friends datacenter recently, where I was told about a recent audit. This friends company spent the whole time trying to fake all the audit criteria: disaster recovery preparedness, security, audit trails, etc. At the risk of sounding like your parents, whom does this behavior really hurt? An audit is an ideal opportunity to validate all the necessary hard work required to run a professional datacenter. And should you ever be subjected to attack, electronic or otherwise, you know that your datacenter will survive. If you didnt get it before, youd better get it now: Faking it is unacceptable. Chances are, at some point you will be required to do a real, honest-to-goodness recovery. And if you think youre safe just because there may not be very many hijacked planes running into buildings such as yours, think again. The threats to your datacenter are diverse and numerous. And, by the way, violent weather, earthquakes and other natural disasters are still there too. Worst Practice 2: Not Testing Once youre serious about continuity and recovery, not only will you plan, but youll test that plan often. There are lots of reasons to test your recovery capability often. Among them are: the ability to react quickly in a crisis; catching changes in your environment since your last test; accommodating changes to staff since your last test. A real recovery is a terrible time to do discovery. Worst Practice 3: Not Documenting One of the biggest problems with disasters is no warning. Thats why so many tests are a waste of time. Anyone can recover when you know exactly when and how. The truly prepared can recover when caught by surprise. Since you wont get any warning except, perhaps, with some natural disasters youll want to have current, updated procedures. Since youll probably be on vacation (or wish you were) when disaster strikes, make sure the recovery procedures are off-site and available. If youre the only one who knows what to do, even if you never take a day off there still wont be enough of you to go around at crunch time. Increasing the Odds of Recovery Worst Practice 4: Taking Too Long At this point in technology, there are two main ways to deal with a disaster: fail-over and reconstruction. With fail-over, you are replicating data between your main site and a recovery site. These sites can be relatively near each other across town or perhaps in an adjoining states or far away. This kind of remote clustering, if you will, is what the largest and most critical institutions use, and the cost is considerable. However, the cost of not doing it is considerably more. Reconstruction is more about recovery than continuity. I am guessing that the vast majority of e3000 shops base their recovery plans on recalling tapes from a vault (e.g., Iron Mountain) to a recovery site, then restoring their data either to a bare machine or one on which only MPE has been installed. This was certainly true for my own operation, as my management always deemed this less expensive method adequate. But that was then. Today, the amount of data that must be reloaded is so massive, that the time to recover renders this method all but worthless. True, your plan can call for a critical subset of data to be restored (not the entire data warehouse). But even current data can now stretch into the terabytes, once you include the applications, utilities, etc. So the point here is to make sure your recovery methodology is practical from a business standpoint, as well as a technical standpoint. You dont want to be in the position of estimating just three more days before youre up and running. Worst Practice 5: Not Recovering a Complete Environment As the state of the art advances, some technology is left behind. Well keep it succinct here: If you need to keep an old technology alive, you may need to provide some or all of the solution yourself. Dont expect the recovery site to stock or maintain every peripheral ever made just because you have one esoteric requirement. And dont forget to keep backup copies of any obsolete software packages as well. Another aspect to this issue, recently discovered at a customer site, is the fact that diverse platforms are now highly integrated. Its not enough just to recover the e3000. The non-e3000 systems that share data feeds must also be recovered. And dont forget any outside data sources either. Again, if youre faking it, you can declare victory when youve reconstructed an e3000 at the recovery site. In reality, that only counts if the e3000 system can support the business on its own without any external feeds. Worst Practice 6: Ignoring the Human Factor Even the best plans dont execute themselves. Keep in mind who will be doing what and how things will get done if key individuals are unable to perform their tasks. As we know, families come first, which is proper: so we mustnt lose sight of our humanity in times of crisis. Any recovery is hard work. That counts double when there are casualties. Reassess Your Assumptions Worst Practice 7: A Defeatist Attitude If youve been subjected to the fake it mentality, youre probably demoralized. After all, who among us just wants to go through the motions? Well, its now a whole new world, and you have a really good shot at doing things right. But you need to forcefully make your case to those who didnt take contingency planning seriously in the past. By the time you read this there may be stories about companies that unfortunately couldnt recover from the September 11 attacks. We can emerge from this atrocity stronger if we do some honest introspection. Every rational businessperson should now be willing to do proper planning. If you can get over the bad practices of the past, you can position yourself and your business to be survivors. Worst Practice 8: Datacenter Placement As much as I enjoyed the view from my 29th floor datacenter, its pretty obvious now that datacenters dont belong in certain places high-rise buildings among them. Besides the obvious prohibitive cost of floor space, there are safety and security issues not obvious until recent events. I have visited many co-location facilities in the past year, and they all had a several things in common: 1. They were in the low-rent district. 2. They were very difficult to find, as they were essentially unmarked. 3. They were very secure (at least relative to downtown datacenters), both physically and electronically. 4. They were redundant up the wazoo. If this does not describe your datacenter, then perhaps its time to consider relocation. Lets face it, even if there are good reasons why your datacenter needs to be right downtown, Ill bet your recovery site is in the middle of nowhere. That should tell you something. Hope for the Best Were currently in reactive mode. Weve now seen one type of unimaginable act, using airliners as missiles. For those unlucky enough to be on the front lines of that atrocity, there was no way to plan for that series of events. And its likely that the next event will also be difficult to imagine, and hence plan for. So even the best plans require a great deal of luck, as even the best plan is useless if there is widespread devastation beyond your control. We should be honest about those aspects of business continuity and recovery that are within our control. We must be truly prepared. But we can still hope that we never need to actually use those plans. Not like we did after September 11. At least thats the hope. Scott Hirsh (scott@acellc.com) former chairman of the SYSMAN Special Interest Group, is an HP Certified HP 3000 System Manager and founder of Automated Computing Environments (925.962.0346), HP-certified OpenView Consultants who consult on OpenView, Maestro, Sys*Admiral and other general HP e3000 and HP 9000 automation and administration practices. Copyright The 3000 NewsWire. All rights reserved. |