Anyone who’s worked in IT for even a short while knows the scenario all too well. A flicker, and then blank – the lights are out, the phones are dead, and the squeal of UPS alarms across the building is almost deafening. A power outage can put you to the test for how quickly you can get systems cleanly shutdown or in some cases, a backup power source up and running. Often times you’ll discover that a percentage of your UPS batteries just aren’t as strong as they needed to be and systems go down the hard way.
The real fun begins when power is restored and systems need to come back online. After a “dirty” shutdown there can be any number of problems that need your immediate attention, including checking for systems that are still down, checking on services that didn’t start, finding errors which need to be reviewed/resolved, and time syncing systems again to name just a few. Add to this a flurry of questions from bosses and coworkers wondering when they can get back online or when an application will be available again and you’ve got a recipe for a nice stress headache.
Recently TNT Software, actually half of our downtown area, experienced a power outage mid-day that put our UPS’s to the test – and unfortunately a few older, obscure ones failed within just a minute or two on some non-critical systems. After the rest of the systems were shutdown properly, the battle of the iPhone vs Android amongst employees began to see who was the fastest to find a news feed describing what the problem was and when the city would have power restored.
In our environment most of the systems’ BIOS are set to automatically startup on power being restored. So the first thing our IT Manager did was to verify that all the systems expected to come back on-line did. This can be accomplished both visually checking the racks as well as by PINGing. However, think about how long it could take to PING all the critical systems on your network. This can be somewhat of a painstaking task without a tool to assist. Sure you could use a script that could read a file of all your system names, run a PING, and report back, but that list of names better be up to date.
Instead of having to manually PING, or worse RDP into each system, ELM Enterprise Manager licenses include a PING Monitor. The PING Monitor sends custom ICMP echo requests to verify TCP/IP connectivity and the Quality of Service. It provides an early warning alert of a problem with the remote system’s status. So if any systems fail to come back online, ELM will be the first to let you know.
Within the ELM Console, the ELM Server At-a-Glance View provides a section at the top with a table showing systems where PING has failed. This is a great ‘short list’ to work from, and as systems come back on-line, and the PING Monitor reports a success state, the system is automatically removed from the failure list.
Next it was time to verify that business critical services are working.
- Email coming and in and going out?
- Able to get on the Internet?
- Core systems accessible and working properly?
Often times this is handled by a checklist you go through or sometimes case by case depending on which end user squawks the loudest about what they need that isn’t working. This can be a very time consuming process bouncing around getting into each system to check services and verifying that business processes are able to run smoothly.
ELM Enterprise Manager Core and System licenses also include a Windows Service Monitor. This Service Monitor detects and responds to service and device state changes; specifically, Starting, Started, Paused, Stopping and Stopped. It is fully customizable to include or exclude any services your servers may be running. This Monitor is commonly used with the Command Script notification, automatically launching a batch file to restart a failed service. This empowers administrators to combine proactive monitoring with automated corrective action. Recovering from a power outage situation, having the Windows Service Monitor in ELM and Command Script restarts is worth its weight in gold.
As the excitement starts to die down and order is close to restored, it is time to turn our attention to the event logs, specifically the Errors, to see what other problems need to be addressed.
ELM Enterprise Manager comes with several preconfigured Event Views with filtering in place to show you the event data you need without having to hunt it down machine by machine or create custom reports.
Each Event View can be toggled back and forth between Detail and Summary View for quick analysis and investigation.
Finally, we like to have the system time in sync across all of our servers. They are set to retrieve the time from the Domain Controller, however after a shutdown, clean or dirty, the time sync can sometimes just get “messed up.”
Rather than manually running a batch file on each system to resync the time, ELM Enterprise Manager’s Event Alarm can automate the job so you don’t have to think twice about it. In our case, the Event Alarm is setup to look for the event within the system log that is generated when an accurate time cannot be found. For example, here’s what a log entry might look like:
The time provider NtpClient is configured to acquire time from one or more time sources, however none of the sources are currently accessible.
No attempt to contact a source will be made for 15 minutes.
NtpClient has no source of accurate time. ------------------------------ 11/10/2010 - 2:51:09 PM Computer: tntdomain1 Type: Error Event 29 Username: None Source: W32Time Category: None Log: System
[/box] When this event is detected, the Event Alarm’s Action is set to run another script to automatically resync. Which script do you run? Our friends at Microsoft provide a knowledge base article that you can literally copy and paste into the command script notification of ELM to execute. You can find it here – http://support.microsoft.com/kb/875424. With an Event Alarm Monitor configured to look for the event above and take the action of a Run Command, any instance of the time being out of sync will trigger the execution of this script to fix it “automagically” for you.
So there you have it. After a power outage ELM is right there to not only help identify the problem areas but also to cut through some of the busywork and get you back running at peak performance in no time at all.
The power outage at TNT Software described in the story above happened mid-afternoon on a Thursday. Now imagine that you just arrived at work on a Monday morning and your outage occurred sometime over the weekend. If your mobile phone didn’t clue you in before you got to work, you can bet the bosses are quite anxious and already pacing the floor when you walk through the door.
Let the cleanup begin – or just finish it up with ELM in your arsenal.
We hope that you found this article on Getting Back to Normal After a Power Outage With the Help of ELM useful and wish you continued success with ELM.