Wednesday, June 08, 2011

Anatomy of an outage - part 1

It happens to Facebook. It happens to Hotmail. And yes, it happens to Eclipse. Last week suffered two distinct outages. Today I'll discuss the first one "briefly".

From a bandwidth point-of-view, this is what a normal 24-hour period looks like.

This is what we saw on the morning May 31:I had just walked into the office when I discovered we were completely offline. Since the Foundation's Ottawa office is wired directly into the switching gear, I was surprised to discover I couldn't even talk to our switches. So I hopped in my car and drove to the Data Center, which is about 10 minutes away.

Turns out a few servers were powered off, as was our main switch and firewall -- the circuit breaker had tripped. When the technician reset the circuit, sparks spewed out of a power supply in our DS4 raid array. Not good.

We pulled the faulty component, reset the breaker and restored power to the switch and firewall. "Problem Solved" I thought. Not so fast.

For reasons unknown, our primary NFS server (which hosts shared files for CVS, Git, SVN, and many of our websites) was frozen. I found that strange, since they do not share the same power circuit. Heck, they aren't even on the same voltage rating. This specific server doesn't normally output video, relying instead on a Hardware Management Console (HMC) to communicate with the operator.

As luck would have it, the HMC was one of the servers connected to the downed power circuit, and filesystem errors were preventing it from coming back online. So I had no insight as to what my faulty NFS server was doing, or why it was frozen.

After a bit of waddling and trying to restart it, I decided to abandon the primary NFS server, and notified fellow webmaster Matt to begin the failover process to the secondary server. That is when services began recovering, about 1h45 minutes later.

To this day, we still haven't figured out what is wrong with our primary NFS server, although Matt has fixed the HMC. And the DS4 storage box is still operating on only one power supply (although another is on order and is expected to arrive soon).

We've decided to begin the process of acquiring new NFS server hardware. The 'old' servers are big, complex machines that have been in service 24/7/365 for almost 7 years. They have served us well, but they are showing signs of age, as memory failures, disk failures and backup battery failures are manifesting themselves more frequently.

Stay tuned for some explanations on the other complete outage, that came upon us just days later...


Post a Comment

<< Home