Tuesday, June 21, 2011

Indigo early access to Friends of Eclipse

An email was sent out to the Friends of Eclipse, announcing early availability to the Indigo bits. Less than 2 minutes later, here is the resulting spike in usage on the Friends of Eclipse Mirror:

If you're a Friend, you can access the download links right now. If not, consider becoming a Friend today as part of the Indigo 500.

Torrent files are also up for early access to our p2p users.

Thursday, June 09, 2011

Anatomy of an outage - part II

Yesterday I posted about the first of two outages we've experienced last week. Today I'll post about the outage we had late in the early afternoon of June 2.

Similarly to our previous outage, we suddenly lost our ability to talk to our primary switch -- a Cisco 2970 24-port Gigabit switch, in service since October 2004.

We feared the worst -- a tripped power circuit caused by a faulty power supply. Again? Not this time -- the switch was simply 'frozen' with an orange alert lamp. After cycling the power, we were back in business.

But why did it just freeze? Was it beginning to show signs of fatigue? Matt and I took no chances -- with enough available ports on our much newer 48-port Cisco 2960 (part of a hardware donation made by Cisco for EclipseCON 2009), we migrated all the connections off to the new switch.

Accurate graphs and documentation allowed us to migrate port settings, VLANs and QoS rules quickly to the 'new' switch.

We need cool screenshots of Eclipse

Have you looked at http://www.eclipse.org/screenshots/ lately?

Yes, it is old.

I was thinking that we could move that page to the Wiki (perhaps http://wiki.eclipse.org/Screenshots ?) so that everyone can submit cool-looking screenshots of Eclipse in action.

If you agree it's a good idea, I need your help to get started. If you create the wiki page and upload/post your Eclipse In Action screenshots, I'll set up redirects from the old page to the new page.

Wednesday, June 08, 2011

Anatomy of an outage - part 1

It happens to Facebook. It happens to Hotmail. And yes, it happens to Eclipse. Last week Eclipse.org suffered two distinct outages. Today I'll discuss the first one "briefly".

From a bandwidth point-of-view, this is what a normal 24-hour period looks like.

This is what we saw on the morning May 31:I had just walked into the office when I discovered we were completely offline. Since the Foundation's Ottawa office is wired directly into the Eclipse.org switching gear, I was surprised to discover I couldn't even talk to our switches. So I hopped in my car and drove to the Data Center, which is about 10 minutes away.

Turns out a few servers were powered off, as was our main switch and firewall -- the circuit breaker had tripped. When the technician reset the circuit, sparks spewed out of a power supply in our DS4 raid array. Not good.

We pulled the faulty component, reset the breaker and restored power to the switch and firewall. "Problem Solved" I thought. Not so fast.

For reasons unknown, our primary NFS server (which hosts shared files for CVS, Git, SVN, and many of our websites) was frozen. I found that strange, since they do not share the same power circuit. Heck, they aren't even on the same voltage rating. This specific server doesn't normally output video, relying instead on a Hardware Management Console (HMC) to communicate with the operator.

As luck would have it, the HMC was one of the servers connected to the downed power circuit, and filesystem errors were preventing it from coming back online. So I had no insight as to what my faulty NFS server was doing, or why it was frozen.

After a bit of waddling and trying to restart it, I decided to abandon the primary NFS server, and notified fellow webmaster Matt to begin the failover process to the secondary server. That is when services began recovering, about 1h45 minutes later.

To this day, we still haven't figured out what is wrong with our primary NFS server, although Matt has fixed the HMC. And the DS4 storage box is still operating on only one power supply (although another is on order and is expected to arrive soon).

We've decided to begin the process of acquiring new NFS server hardware. The 'old' servers are big, complex machines that have been in service 24/7/365 for almost 7 years. They have served us well, but they are showing signs of age, as memory failures, disk failures and backup battery failures are manifesting themselves more frequently.

Stay tuned for some explanations on the other complete outage, that came upon us just days later...