Quantcast
Channel: Hacker News 50
Viewing all articles
Browse latest Browse all 9433

hardware - Something is burning in the server room; how can I quickly identify what it is? - Server Fault

$
0
0

Comments:"hardware - Something is burning in the server room; how can I quickly identify what it is? - Server Fault"

URL:http://serverfault.com/questions/496139/something-is-burning-in-the-server-room-how-can-i-quickly-identify-what-it-is/


The general consensus seems to be that the answer to your question comes in two parts:

How do we find the source of the funny burning smell?

You've got the "How" pretty well nailed down:

  • The "Sniff Test"
  • Look for visible smoke/haze
  • Walk the room with a thermal (IR) camera to find hot spots
  • Check monitoring and device panels for alerts

You can improve your chances of finding the problem quickly in a number of ways - improved monitoring is often the easiest. Some questions to ask:

  • Do you get temperature and other health alerts from your equipment?
  • Are your UPS systems reporting faults to your monitoring system?
  • Do you get current-draw alarms from your power distribution equipment?
  • Are the room smoke detectors reporting to the monitoring system? (and can they?)

When should we troubleshoot versus hitting the Big Red Switch?

This is a more interesting question.
Hitting the big red switch can cost your company a huge amount of money in a hurry: Clean agent releases can be into the tens of thousands of dollars, and the outage / recovery costs after an emergency power off (EPO, "dropping the room") can be devastating.
You do not want to drop a datacenter because a capacitor in a power supply popped and made the room smell.

Conversely, a fire in a server room can cost your company its data/equipment, and more importantly your staff's lives.
Troubleshooting "that funny burning smell" should never take precedent over safety, so it's important to have some clear rules about troubleshooting "pre-fire" conditions.

The guidelines that follow are my personal limitations that I apply in absence of (or in addition to) any other clearly defined procedure/rules - they've served me well and they may help you, but they could just as easily get me killed or fired tomorrow, so apply them at your own risk.

#1: If you see smoke or fire, drop the room
This should go without saying but let's say it anyway: If there is an active fire (or smoke indicating that there soon will be) you evacuate the room, cut the power, and discharge the fire suppression system.
Exceptions may exist (exercise some common sense), but this is almost always the correct action.

#2: If you're proceeding to troubleshoot, always have at least one other person involved
This is for two reasons. First, you do not want to be wandering around in a datacenter and all of a sudden have a rack go up in the row you're walking down and nobody knows you're there. Second, the other person is your sanity check on troubleshooting versus dropping the room, and should you make the call to hit the Big Red Switch you have the benefit of having a second person concur with the decision (helps to avoid the career-limiting aspects of such a decision if someone questions it later).

#3: Exercise prudent safety measures while troubleshooting
Make sure you always have an escape path (an open end of a row and a clear path to an exit).
Keep someone stationed at the EPO / fire suppression release.
Carry a fire extinguisher with you (Halon or other clean-agent, please).
Remember rule #1 above.
When in doubt, leave the room.

#4: Set a limit and stick to it
More accurately, set two limits:
Condition ("How much worse will I let this get?"), and
Time ("How long will I keep trying to find the problem before its too risky?").
The limits you set can also be used to let your team begin an orderly shutdown of the affected area, so when you DO pull power you're not crashing a bunch of active machines, and your recovery time will be much shorter, but remember that if the orderly shutdown is taking too long you may have to let a few systems crash in the name of safety.

#5: Trust your gut
If you are concerned about safety at any time, call the troubleshooting off and clear the room.
You may or may not drop the room based on a gut feeling, but regrouping outside the room in (relative) safety is prudent.

If there isn't imminent danger you may elect bring in the local fire department before taking any drastic actions like an EPO or clean-agent release. (They may tell you to do so anyway: Their mandate is to protect people, then property, but they're obviously the experts in dealing with fires so you should do what they say!)

We've addressed this in comments, but it may as well get summarized in an answer too -- @DeerHunter, @Chris, @Sirex, and many others contributed to the discussion

Viewing all articles
Browse latest Browse all 9433

Trending Articles