Wednesday, September 30, 2009

Wow, what a painful release this was (is?)

If you were able to upgrade to Galileo SR1/Eclipse 3.5.1 in a timely manner, you were probably just lucky.  Even today, 5 full days after the release, our servers are still crawling.

What happened?

In late August, Karl and I had installed our new Cisco load balancer and firewall.  Unbeknownst to us, we were dropping connections.  A few committers noted that CVS connections were causing broken builds, and we had early reports from our mirrors that their RSYNC connections were being terminated.  We didn't pay too much attention to the RSYNC issue in favour of resolving CVS, since RSYNC is one of those robust protocols that is essentially bomb-proof.

Mistake 1.

Fast forward to Friday, Sept. 25. I did real quick mirror check and everything checked out.  We're good to go.

Mistake 2.

I mean, this is just a point release, and I've done millions of these. Business as usual, right?

Mistake 3.

At around 3:00pm ET on Friday, I was getting reports that the ZIP files were missing on most of the mirrors, despite the fact that they were considered in sync.  Uh oh.  Since Karl had found (and fixed) some short timeouts that may have caused the dropped connections, I went on to assume that mirrors were simply not yet fully up-to-date with Galileo SR1, and that they would be in sync sometime during the weekend.

Mistake 4

As it turns out, since late August, our mirrors would begin syncing, but would never finish.  They were all badly out of date, but still considered in sync but because they checked in regularly. So they spent most of the weekend simply catching up, without actually getting the new SR1/3.5.1 files.

On Monday, the above became painfully apparent when we were caught serving p2 updates for most of the planet from a single 100 megabit internet connection. At this point, mirrors were having a difficult time pulling updates from us.  I then brilliantly re-routed most of our downloads to our Amazon AWS account, after making sure it was in sync.

Wrong again, hero.

My uploads to AWS were also not completing. Apparently, when you update Eclipse, there are content/artifact jar files everywhere in our tree that need to be fetched. Some of those were not on AWS yet, causing the updates to fail.

"Epic Fail." What have you learned?

When you think it's business as usual, you're probably wrong.  Plenty of learned lessons here.

What happens now?

Most of our mirrors are now in sync, and so is our Amazon AWS.  p2 probably got burned by many broken mirrors and now only trusts the home site.  It will eventually learn to trust its mirrors again. Until then, updates may be a bit slow, but they should succeed.

5 Comments:

Anonymous Chris Aniszczyk said...

Thanks for being on top of this!

10:26 AM  
Anonymous Denis Roy said...

Thanks, but I wish I had REALLY been on top of it last Friday!

11:27 AM  
Anonymous Ian Bull said...

Thanks Guys! Like always, when things go well nobody notices... after all this was a point release and you've done millions of these ;-). But when things go wrong for you -- all hell breaks loose. You guys do an excellent job! Good work tracking this down... I'm looking forward to March when we can all sit in the Hyatt lobby, have a few beers, and a few good laughs about this.

12:05 PM  
Anonymous Nick Boldt said...

Yet another year where it's assumed that "it's just a maintenance release, what could go wrong?"

*sigh*

For the third year in a row, let me sing the same song: maintenance releases are just like GA releases -- EVERYONE WANTS THEM IMMEDIATELY. Thus, they must be treated with the same forethought, planning, panic and apprehension as in June. Capisce?

12:38 PM  
Anonymous Denis Roy said...

Yeah... Feature request? After 5 years I thought I knew the lingo...

5:04 PM  

Post a Comment

<< Home