Friday, February 17, 2012

Wednesday's outage explained

Last Wednesday just after 9:30am Eastern time my SSH console to became unresponsive. Both our primary AND secondary NFS servers were no longer responding, and as a result most of was off the air. Since the failed servers are physically elsewhere, it's not like we can easily walk over to the console to see what has happened.

Usually, when one server ceases to respond, the problem is with the server. When two servers on the same network segment cease to respond at the same time, it's anything but the server. But Matt and I took no chances and split up: I investigated the network side and the possibility of a kernel DoS/exploit, and he hopped in his car to go see what's happening on the server side. Fortunately, the servers are only 10 minutes away.

As it turns out, the Linux kernel crashed on both servers, each within minutes of each other. Here's a sample of what we saw in the logs:

Feb 15 09:34:58 kernel: [18446743997.844366] WARNING: at [snip]/kernel/sched.c:3878 find_busiest_group+0xc79/0xce0()
Feb 15 09:34:58 kernel: [18446743997.844370] Hardware name: X8DT6
Feb 15 09:34:58 kernel: [18446743997.844417] Pid: 51, comm: events/0 Not tainted

Both servers are physically identical and were brought online about the same time, so this whole thing smells like something I've heard of before. To make me feel even better, the Kernel bug that closely matches what we've experienced is still open today:

After restarting both servers, we discovered that our rather large OpenLDAP server's database has some data corruption, and some specific operations cause it to segfault. Those numerous LDAP crashes meant it was difficult for anyone to get anything done on Wednesday.

It's all fun :) Any bets on when this will happen again?

Monday, February 06, 2012

EclipseCon location change -- poll results

Last week I issued a webmaster "whacky poll", asking you your thoughts on the location change for EclipseCon this year. The results are in!

97 /doesnt-matter-where-it-is-as-long-as-there-is-enough-beer

Earning the #1 spot on the poll, I think it's clear what the priorities are...

93 /doesnt-matter-where-it-is-as-long-as-the-awesome-p2-guys-are-there

That wasn't actually an option on the poll.. However, it seems to have gone viral. Or maybe someone was stuffing the ballot box :)

50 /no-matter-where-eclipsecon-is-webmasters-will-still-buy-us-beer-right?

I think that is one of those "life certainties"...

38 /yay-less-time-on-an-airplane
27 /oh-no-more-time-on-an-airplane

Looks like more people will be spending less time on a plane. In other news, 27 people will be coming from California :)

38 /the-weather-better-be-warm-and-the-beer-cold
29 /if-its-not-in-california-im-not-going

You know, for the last few EclipseCons in California, the weather wasn't all that warm. I hope Washington treats us right.

In the "why-not-do-eclipsecon-in-(insert-tropical-exotic-location-here)" category:
  • Melbourne, Australia
  • Ibisa
  • Fiji
I'm sorry, but Toronto doesn't qualify as either tropical nor exotic...

Others have also improvised their own entries... Such as this go-green one:

1 /yay-less-carbon-dioxide-maybe-we-can-survive-on-this-planet

1 /hi-denis-i-will-ask-my-obeos-colleagues-to-buy-you-some-beers

You can always count on the Obeo guys to give back to the community.

A few people wrote in to say they couldn't attend this year, but this entry stood out:

1 /alblue-cant-make-it-again-will-cover-remotely

That's a shame -- we'll be missing the unique ties again this year.

That wraps up my whacky poll for EclipseCon! I'm looking forward to seeing everyone there.

Thursday, February 02, 2012

EclipseCon Poll: What do you think of the location change?