Wow, what a painful release this was (is?)
In late August, Karl and I had installed our new Cisco load balancer and firewall. Unbeknownst to us, we were dropping connections. A few committers noted that CVS connections were causing broken builds, and we had early reports from our mirrors that their RSYNC connections were being terminated. We didn't pay too much attention to the RSYNC issue in favour of resolving CVS, since RSYNC is one of those robust protocols that is essentially bomb-proof.
Fast forward to Friday, Sept. 25. I did real quick mirror check and everything checked out. We're good to go.
I mean, this is just a point release, and I've done millions of these. Business as usual, right?
At around 3:00pm ET on Friday, I was getting reports that the ZIP files were missing on most of the mirrors, despite the fact that they were considered in sync. Uh oh. Since Karl had found (and fixed) some short timeouts that may have caused the dropped connections, I went on to assume that mirrors were simply not yet fully up-to-date with Galileo SR1, and that they would be in sync sometime during the weekend.
As it turns out, since late August, our mirrors would begin syncing, but would never finish. They were all badly out of date, but still considered in sync but because they checked in regularly. So they spent most of the weekend simply catching up, without actually getting the new SR1/3.5.1 files.
On Monday, the above became painfully apparent when we were caught serving p2 updates for most of the planet from a single 100 megabit internet connection. At this point, mirrors were having a difficult time pulling updates from us. I then brilliantly re-routed most of our downloads to our Amazon AWS account, after making sure it was in sync.
Wrong again, hero.
My uploads to AWS were also not completing. Apparently, when you update Eclipse, there are content/artifact jar files everywhere in our tree that need to be fetched. Some of those were not on AWS yet, causing the updates to fail.
"Epic Fail." What have you learned?
When you think it's business as usual, you're probably wrong. Plenty of learned lessons here.
What happens now?
Most of our mirrors are now in sync, and so is our Amazon AWS. p2 probably got burned by many broken mirrors and now only trusts the home site. It will eventually learn to trust its mirrors again. Until then, updates may be a bit slow, but they should succeed.