Wednesday, September 28, 2005

Better search

The Search engine at gets its fair share of use; unfortunately, the results don't always seem relevant. The search engine itself, although not a Google, is not a bad piece of software in its own right -- it simply has a hard time indexing the entire site with the current configuration. The main problems are:

- There are a bizillion pages on the site, and the indexer is not configured to rank them properly. Currently, the body of a page has (by far) the most weight, and the URL, Title and META tags little to none. With this configuration, a mail archive page containing the word "SWT" 10 times seem more relevant than the SWT home page when simply searching for "SWT".
- The and Infocenters ( sections are currently not indexed, yet those two servers contain valuable information
- the sheer quantity of Mail and News archives pages is astonishing (300,000 and counting), and because of the poor ranking configuration and the natural recurrence of rich keywords within each page, these documents often get top ranks -- yet mean little to the crowd actually using the search engine

To achieve better search results, I've installed the latest version of mnogo, and I've tuned it for the content:

- the title has the most weight, followed my META tags (keywords, description)
- the body comes in next, but much lower
- the URL now plays an important part - a page with "birt" in the URL will rank higher when searching for birt
- page headings, using h1, are now considered as well
- default search will exclude mail and news archives. Most searches are done using generic terms, so those folks looking for a snippet of code buried deep in an archive will know to use an extended search
- a more user-friendly scope for searching: website only, downloads, documentation, archives

The new search engine is currently indexing the site now. It should take another day or two for it to complete. I'll post details when I'm ready for you to give it a test run.


Anonymous Vineet said...

I know this is an additional project, but as an Eclipse plug-in developer it would be nice to also have the entire source code indexed as well. Possibly something like lxr.

6:31 AM  

Post a Comment

<< Home