Friday, May 29, 2009

Hey Denis, what is that on your screen?

Oh, nothing of interest.

Monday, May 11, 2009

Optimizing SQL: maximizing query cache

Today's topic is focused on getting the most out of MySQL's query cache.  Query cache isn't on by default, so if your servers aren't using it, you're missing out.

Let's look at how the query cache works.  Witness a query I've used before:
select translation_id from translations where string_id in
(select string_id from strings where file_id = 35) and language_id = 2;
126 rows in set (3.08 sec)

If I run the exact same query again, the performance improvement is dramatic:
126 rows in set (0.00 sec)

So the query cache maintains a list of queries and their resultset, but add a simple space to the query, or capitalize the 's' in Select, and to MySQL, it's a whole new query.  Furthermore, if the data is changed in any of the tables referenced by the query, the query and its resultset is cleared from cache and the query must be fully processed.

Don't be fooled by functions!

The following query, which returns changes that occurred yesterday and earlier, references the ever-changing value of NOW(), and will never be cached:
select translation_id from translations where string_id in
(select string_id from strings where file_id = 35) and created_on < now();

Instead, use a static value if you can:
select translation_id from translations where string_id in
(select string_id from strings where file_id = 35) and created_on < "2009-05-11";

Avoid costly updates

If your data changes so often that it's killing the effect of query cache, consider using separate tables for lookups (that get infrequent batched updates) and for updates.

As an alternative, if you're using a Master-Slave setup, configure your application to use the slave for SELECT queries and then configure low_priority_updates. A poor man's solution to preventing data from changing on your slave (to enhance query cache hits) is to cycle SLAVE STOP and SLAVE START commands with sleep intervals in between, so that data isn't constantly being written to.

At this stage, perhaps an even better (although more complex) solution would be to enable cache for  the application to avoid touching the database altogether (memcached on PHP, squid for Apache, and so on.)

Of course, all of these solutions have drawbacks, so use them at your discretion.

Pre-cache large queries

The Dash project's Project Dashboards use a large MySQL database on a very small server.  Once a week all kinds of information is gathered from CVS and SVN and all the tables are rebuilt.  While once-a-week updates are great for query cache, the first hits to any given page can take a long time to run.

The inexpensive solution was to incorporate a series of wget commands into the script that rebuilds the data, so that the queries on some of the frequently accessed pages can be cached immediately after the build, thus rendering instantly for everyone else.

Alternatively, wget's output pages could simply be saved to static html (such as is the case for the Bugzilla homepage, with its top-10 queries) for even greater gains.

MySQL's query cache is a simple and effective way to squeeze performance out of your database. It is by no means the solution to all your problems, so use accordingly.

Up next: simple text search

Friday, May 08, 2009

Optimizing SQL Part 2: the size of your queries

The size and complexity of a SQL query can greatly influence its execution time. Since MySQL 5.0, subselects are now supported, but I often see SQL novices overuse subselects with disastrous results.

For instance, consider these queries I use in Babel.  For accurate results, I ran all the queries once, then cleared the query cache, then ran them again.
select string_id from strings where file_id = 35;
83 rows in set (0.00 sec)

select translation_id from translations where string_id in 
(select string_id from strings where file_id = 35) and language_id = 2;
126 rows in set (8.18 sec)

If I replace the above subselect with the exact list of string_id values from the first query, look at the results:
select translation_id from translations where string_id in 
(1072,1073,[snip: 80 others],197676) and language_id = 2;
126 rows in set (4.94 sec)

In this case, without any further optimization, it's worth issuing two simple queries rather than a single subselect.

Of course, in() cannot compete with an inner join for this type of query:
select translation_id from translations 
inner join strings on strings.string_id = translations.string_id
where translations.language_id = 2 and strings.file_id = 35;
126 rows in set (0.02 sec)

Subselects are not always your friend. Using table aliases can then cut down on the amount of bytes you're sending to your server, reducing the load on your network stack, and making your query easier to read:
select translation_id from translations t 
inner join strings s on s.string_id = t.string_id
where t.language_id = 2 and s.file_id = 35;

Next tip: maximizing query cache

Thursday, May 07, 2009

Optimizing SQL: Part 1 - minimizing connections and query counts is a pretty busy site, and for the most part, it's well optimized and quite fast.  I'm not talking about Bugzilla or CVS here, only

I'm a big fan of profiling and optimizing. Perhaps this is because I learned to program on a Timex Sinclair 1000 with 2KB of RAM, where Every Byte Counts!

In this multi-part series, I'll walk through some of the very simple optimizations I've done at Eclipse to make the site as fast as it can be with the hardware we have.  The first installment:

Minimizing connections and query counts

You're likely very familiar with the Pick a Mirror page, the most accessed page on our site. That page needs database information to render, so I tend to keep it lightweight and optimized.

Since I recently added Bit Torrent as a download option, I wanted the Pick a Mirror page to show you the Torrent link if the selected file had one.  I created a simple table called torrents_index which contained the name of the torrent file, so a simple pattern match between your file and the table would suffice.

But that involves querying the database. Our web servers and database servers are separate machines, so each query and response hits the LAN (generally).

Since the Pick a Mirror PHP script already queries the database for basic information about the file, I simply tapped into an existing query by adding a reference to TOR.torrentname (added content in red):
$sql = "SELECT DRP.drop_id, VAR.key_value, TOR.torrentname
LEFT JOIN SYS_variables AS VAR ON VAR.key_name = 'download_table'
LEFT JOIN torrents_index AS TOR ON (TOR.torrentname LIKE '%${filename_fileonly}.torrent')
WHERE LEFT('$_file', LENGTH(DRP.our_path)) = DRP.our_path
AND LOCATE(DRP.file_pattern, '$_file') > 0
$rs = mysql_query($sql, $dbh);

In the above query, you'll notice that I'm fetching data from three different tables, DRP, VAR and TOR by using a JOIN, but the tables are not actually related in any way.

This simple technique of reusing a query is very efficient. It avoids another call to mysql_query(), and saves an entire round-trip to the database server.  Overhead to the existing query is minimal to nil, as observed by timing its execution and showing the query plan with EXPLAIN.

Next up: Optimizing the size of your queries

Tuesday, May 05, 2009

Favorite Linux IDE

My latest copy of the Linux Journal (yeah, an actual magazine, printed on paper...  with ink), featuring the Readers' Choice Awards 2009, shows  Eclipse as the winner of the Favorite Linux IDE award. This quote sums it up:

"In Eclipse, one can work in a lean environment and add and subtract an incredible array functionality with its myriad modules".

This quote was quite interesting:

"At the same time, the second largest vote-getter was “Other”.

Audacity, Amarok,, Adobe Air and, uh, Dell also got picked as Linux favorites.  Ubuntu got the Distribution of Choice award, which is ironic for me, because I'm getting ready to ditch it and return to my Good Old Fedora.

Monday, May 04, 2009

Eclipse 3.5M7 torrents

I've just put some Bit Torrents up for 3.5M7: