Wednesday, September 30, 2009

Wow, what a painful release this was (is?)

If you were able to upgrade to Galileo SR1/Eclipse 3.5.1 in a timely manner, you were probably just lucky.  Even today, 5 full days after the release, our servers are still crawling.

What happened?

In late August, Karl and I had installed our new Cisco load balancer and firewall.  Unbeknownst to us, we were dropping connections.  A few committers noted that CVS connections were causing broken builds, and we had early reports from our mirrors that their RSYNC connections were being terminated.  We didn't pay too much attention to the RSYNC issue in favour of resolving CVS, since RSYNC is one of those robust protocols that is essentially bomb-proof.

Mistake 1.

Fast forward to Friday, Sept. 25. I did real quick mirror check and everything checked out.  We're good to go.

Mistake 2.

I mean, this is just a point release, and I've done millions of these. Business as usual, right?

Mistake 3.

At around 3:00pm ET on Friday, I was getting reports that the ZIP files were missing on most of the mirrors, despite the fact that they were considered in sync.  Uh oh.  Since Karl had found (and fixed) some short timeouts that may have caused the dropped connections, I went on to assume that mirrors were simply not yet fully up-to-date with Galileo SR1, and that they would be in sync sometime during the weekend.

Mistake 4

As it turns out, since late August, our mirrors would begin syncing, but would never finish.  They were all badly out of date, but still considered in sync but because they checked in regularly. So they spent most of the weekend simply catching up, without actually getting the new SR1/3.5.1 files.

On Monday, the above became painfully apparent when we were caught serving p2 updates for most of the planet from a single 100 megabit internet connection. At this point, mirrors were having a difficult time pulling updates from us.  I then brilliantly re-routed most of our downloads to our Amazon AWS account, after making sure it was in sync.

Wrong again, hero.

My uploads to AWS were also not completing. Apparently, when you update Eclipse, there are content/artifact jar files everywhere in our tree that need to be fetched. Some of those were not on AWS yet, causing the updates to fail.

"Epic Fail." What have you learned?

When you think it's business as usual, you're probably wrong.  Plenty of learned lessons here.

What happens now?

Most of our mirrors are now in sync, and so is our Amazon AWS.  p2 probably got burned by many broken mirrors and now only trusts the home site.  It will eventually learn to trust its mirrors again. Until then, updates may be a bit slow, but they should succeed.

Friday, September 25, 2009

Galileo SR1 is here!

Galileo SR1, based on Eclipse 3.5.1, is here!  You can fetch your favourite goodies from the usual URL:

http://eclipse.org/downloads/

Thursday, September 24, 2009

Galileo SR1 - available early for Friends of Eclipse

Galileo SR1 is available a day early for Friends of Eclipse.  Here is your link:

http://friends.eclipse.org/galileo_sr1.html

Of course, it's never a bad time to become a Friend:

http://www.eclipse.org/donate/

Wednesday, September 02, 2009

What is my NFS server doing?

If you're running an NFS daemon (nfsd), at some point in time you may have wondered what it was doing right now.  If it's running in kernel space, tools like lsof and strace don't work, so you're left guessing.

After much Googleing and some inspecting of the Kernel source code, I discovered some debugging values that can be poked into /proc/sys/sunrpc/nfsd_debug.  The most useful was 32, which I used like this:
echo 32 > /proc/sys/sunrpc/nfsd_debug; tail -f /var/log/messages | grep lookup

Essentially, this will give you an idea as to what files are being served up by nfsd.  Be careful, though: on a busy NFS server, this will spew lots of output to /var/log/messages.

After stopping the above command with CTRL+C, don't forget to turn off nfsd_debug:
echo 0 > /proc/sys/sunrpc/nfsd_debug

With this trick I was able to find some nasties that were hurting our NFS performance.

Fun times at work

What a fun quarter this has been so far.  It started with the new Forums site, then Matt and I performed some much-needed hardware maintenance, Karl and I swapped all our Cisco devices for new ones, and I upgraded Bugzilla last weekend.  In the mix, I've been hunting down MySQL and NFS problems and looking for all kinds of optimizations to try to restore some snappiness to our site.

Both Bugzilla and the Forums still need a bit of work, and after that I tackle another big toy: Git.

Fun times!