Tuesday, August 07, 2007

What the heck happened this weekend?

I can't post this in the most timely manner because in their infinite wisdom, Blogger have determined that eclipsewebmaster is a spam blog. Uh, what? Apparently too many links to one site. Gosh... it IS the eclipse.org Webmasters' blog. So I have put in a request to get this rectified and this will be posted as soon as they set themselves straight.

So what the heck happened this weekend? Some of you noticed that many services were not working properly over the weekend at certain points. This included: Bugzilla, the MyFoundation Portal, Eclipsepedia, and other services. Luckily CVS was not affected. So what happened?

We have two main back-end servers. These were kindly donated by IBM in 2004 and are quad Power5 boxes. One of them had a hardware/kernel interaction problem early last week that took down the site for a short time while we failed over to the secondary box. That machine then was shouldering the load for both servers. Everything was working fine for some time until on Friday night something locked up the database. Because we are running everything on these big boxes, we have one DB server for a lot of critical services. That's good because it's easy to build redundancy, but bad when the redundancy is out of commission. That's the situation we faced heading into the weekend. As soon as Matt brought the server back up it was slammed with a huge number of queued jobs, bringing it to a near standstill within 30 minutes. By early Saturday morning Pacific time, Matt and I had gotten the DB back on its feet and everything was humming along again. But we were still running on just the one server.

Early Monday morning one of the projects launched a very large number of queries in a very short amount of time. Each of these queries was running in its own DB connection. The result was that since we are running on a single server the total number of possible connections was eaten up by this job. Again everything came to a standstill. Denis took care of this and got the servers back on line. But by this point we had two fairly extended outages.

So that's where we stand. We're working hard to recover the downed server and bring it back online as a secondary. We are going to move some of the services around to make them more reliable. The root of it all is that we have reached the limit of the current architecture. Denis has us developing a plan to improve our hardware situation in next year's budget cycle and thus allow us to change the architecture. We're are doing our very best. In the meantime if anyone wants to gift us with a production-class SCSI RAID array and disks, or say a SAN array and cards, that might help us out now... ;)


Anonymous AlBlue said...

You guys lead thankless lives toiling over the architecture ... the fact that it stays up for such long periods and that this is an unusual event (combined with hardware issues) goes to show that the job you guys do is exemplary!

Keep up the good work, and hope the recovery goes well.

4:00 AM  

Post a Comment

<< Home