Monday, April 12, 2010

Setting up a download server.. How much RAM do I need?

NOTE: This is an old post that was picked up by PlanetEclipse as a result of moving my blog.


If you've been following along, you know that I received a bunch of hardware to upgrade Eclipse.org.   'Keep it in RAM' is my thinking with all this.  Currently, our servers spend a lot of time waiting for disk resources, so the benefit of keeping requests in RAM is twofold: cached files are served quickly, and disk resources are freed for faster access to disk.

Right now I'm in the process of setting up a high-capacity download server to replace download.eclipse.org, so the question is -- how much RAM do I need?

I started by examining the combined Apache logs of download.eclipse.org for any given day.
zegrep "GET .* 200 " download.eclipse.org/access_log.3.gz | awk '{print $7}' > filelist
filelist contains 2,439,051 successful "GET" requests

sort filelist | uniq -c | sort -nr > filelist-sorted
filelist-sorted contains 73,184 entries

So each day our server only reads about 75,000 files -- but serves them 2.4 million times.  There is a huge potential for cache hits right there.  With a small perl script, I gathered the size (on disk) of those 73,184 files.  Total: 25G.



#!/usr/bin/perl

open(FILE, "<filelist-sorted");
while(<FILE>) {
        $_ =~ s/\n//;
        ($c, $f) = split(/ /, $_);
        $size = -s "downloads" . $f;
        print "$size $c $f\n";
}

So if I get 24G of RAM, I'm sure that most of the file requests will come from cache.  Actually, if I load up the numbers in a spreadsheet, I see a wicked long-tail distribution.  In fact, 134 files are fetched at least once each minute and account for 43% of all requests.  If you consider the RAM requirements of the OS and Apache, 24G would be great -- for today's needs.  What about next year?

Considering it's cheaper to buy RAM when it's popular (commodity), I put 64G of RAM in the new download.eclipse.org.  It should be more than sufficient to hold the entire week's worth of download files, file attributes and such while keeping disk requests to a minimum.  It will also have plenty of RAM for the OS and Apache, even when things heat up in June.

We're also moving to the Apache worker MPM for download.eclipse.org.  It is multi-threaded, so with high client counts, it consumes much less memory than the prefork model.  PHP files (which are only a very small fraction of the hits) will be served over FastCGI.

So there you have it.  Your downloads won't necessarily be faster since we are limited by bandwidth, but they should begin faster.  The new setup will also free disk resources for those files that cannot always be cached, such as CVS, SVN and Git.  Win-win!

0 Comments:

Post a Comment

<< Home