I’ve just spent 3 days looking at the output files produced when a Lotus Server (or client) crashes or hangs. It also had me thinking about how IT departments manage service restoration and what they could learn from other organisations.
Back in 1995 (eeks feel old now) I attended a long weekend course in Emergency Planning and Crisis Management as part of my voluntary work. The course was run by the UK Home Office Emergency Planning College (yes there is such a thing!). This taugh the principles of emergency planning. More and more I’m seeing how these principles should be transferred into the IT sector.
The principle in Emergency Planning is simple, if you can’t manage the incident with your normal resources in that locality then it is a major incident. So in the IT context if you can’t maintain normal service to your clients with the team you have in place and need to call in expertise then lets class that as a “major incident”.
The important thing with any incident is command and control with an assosciated comminucations plan. So often I see outages where one engineer is working on resolution but is constantly pulled away from the task to talk to his manager, then the service delivery manager, then the account manager, then the project manager …….. and on and on.
The Emergency Planners spotted this problem years ago, many many years ago! Their solution is a simple three tiered command and control structure with:
– Gold Control
– Silver Control
– Bronze Control
These folks sit well away from the incident (in the Police sense this would be the Force Headquarters). Gold Control is the Strategic Level. These folks have overall control. They look after getting extra resources, generally look further ahead and really concerntrate on the next 24-48 hours.
In the IT sense this would be your technical managers, service delivery managers, account managers and project managers. They only talk to Silver! They worry about what happens after service restoration, communications with users, contingency plans if service isn’t restored etc.
Silver Control in the police sense would be the local division headquarters. They worry about releaving staff in the next 12 hours, escalating issues from Bronze up to Gold. Feeding relavent information from Gold down to Bronze. They act as a buffer.
These folks in the IT sense would be the local team leads with staff working on the problem. They have enough technical knowledge to make decisions for the next few hours in terms of what direction to proceed in and which resources do what. They don’t continually annoy Bronze for information and actively protect them from interference.
In the police sense this would be the officer in charge at the incident site. What is happening now and in the next few hours. How to manage the incident right now. Not looking too far into the future.
In the IT sense this is your senior engineers who are part of a service restoration team. They shouldn’t be annoyed directy by management and should only have communication with the Silver team. They are the get on with it get it fixed and worry about the next few actions team.
In my view implementing some of this knowledge and sharing best practice from other areas where incident management it the KEY role of the particular organisations is the way forward in this area.