Wednesday, May 30, 2007

Europa: 30 days and counting

I've been quite busy lately with various things that I'm not seeing time go by. Quick glances at the server monitors shows heavy CVS activity, and Bugzilla is taking a heavier beating than usual. That smells like an ... an upcoming... RELEASE. Or worse, 20 releases.

Unlike last year's Callisto, which (only) included 10 projects, this year's Europa means 21 projects will be releasing new software at the same time. New releases from heavyweight projects like the CDT, WebTools and Eclipse itself are major events here in Eclipseland, so you can imagine the chaos of releasing 21 projects at the same time as our site struggles to keep up with hundreds of thousands of download requests.

This is like having Apache release a new httpd, Tomcat, Lucene, and a bunch of other projects at the same time. Or Mozilla releasing new major versions of Firefox and Thunderbird at the same time.

Luckily for us webmasters, we have last year's Callisto experience to lead us through Europa this year. Although more bandwidth and tweaked TCP stacks will be important for us this year, the key players in ensuring a smooth distribution of files will be our mirror sites -- without them, getting your bits would be infinitely more difficult (or expensive).

30 days to Europa - let the games begin.

Thursday, May 10, 2007

DoS attacks from Google? Look again

Lately an interesting type of DoS (denial of service) attack has been hitting the various Eclipse sites, and although I'm not sure if it's widespread or just an Eclipse thing, it could affect Google as well.

Here's what happens: load on the servers and databases slowly increases as Apache serves the home page of a site (and only the home page -- no images, CSS or other related files) to the same IP address at a very rapid rate (several times per second). As the new requests come in faster than the served connections are closed, within minutes the server starts to run out of resources. The catch is, if I look at the logs, I see hundreds, no -- thousands of lines like these:

(ip hidden) - - [10/May/2007:06:20:42 -0400] "GET / HTTP/1.0" 301 232 "http://live.eclipse.org" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
(ip hidden) - - [10/May/2007:06:20:42 -0400] "GET / HTTP/1.0" 301 232 "http://live.eclipse.org" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

Googlebot? Sheesh! You'd think Google could write a smarter bot! Just as I ready myself to write a nasty e-mail to Google, I notice that the Googlebot's IP address doesn't really look like a Google IP address (you get to know these after a while). After some digging around, I discovered that the offending IP address is registered to some ISP in Connecticut.

I happened to catch the first two attacks red handed on Tuesday, and I was able to block the culprit IP addresses on our firewall before any significant interruption of service occurred. Yesterday I hacked some DoS protection into one of our monitoring scripts, just in case this happened again. Lo and Behold, this morning there were two Attack warnings in the webmaster box - both from these fake Googlebots, both fetching a homepage dozens of times per second. Both got blocked on our firewall.

What a waste of resources. Don't do stuff like this. You're just dumb if you do. And you'll lose all your hair.

Tuesday, May 08, 2007

Bugzilla: avoiding stale searches

We have a master-slave MySQL replication setup here at Eclipse for redundancy, and Bugzilla is configured to use the slave DB for those SELECTs that are appropriate for the slave to handle. This helps performance greatly, especially when small queries need to wait for tables locked by large queries. Our Bugzilla database isn't small, and it's open to the world, so it takes a huge beating - people issue the darndest queries, and lots of them - so some queries can take minutes to run, causing the slave's data to be lagged behind the master.

When the slave is lagged, weird stuff happens to Bugzilla. The most popular of complaints occurs when a user runs a named search and sees a bug they recently closed still displayed with an Open state. Confusing, frustrating, and avoidable.

I wrote a host of system monitoring scripts for the Eclipse servers, and one of those scripts is a MySQL monitor. It reports usage metrics for a nifty web page we use, and it also kills queries that run for too long. I recently hacked in functionality that updates the Bugzilla parameters on-the-fly so that it used the master DB exclusively, should the slave DB become lagged more than 120 seconds. When the slave catches up, Bugzilla parameters are changed again to continue using the slave for maximum performance.

Since I implemented this late last week, Bugzilla was switched to the master (and back) at least a dozen times as a result of heavy load on the slave. Good performance + up-to-date queries = happy committers. I like that.

Gunnar "you must do the right thing" Wagenknecht suggested that I release these infrastructure scripts under the EPL, so I'm in the process of doing so. If they're useful for us, they might be useful for someone else too.