Wednesday, July 2, 2008

Searching for something...

So one of the cooler things about my current job is to work on things that inspire me keep thinking about them after I leave the office. Over the last couple weeks I've been thinking about a php/mysql based site search engine for sites on the order of 100,000 pages and below.

Since the data for these sites is generally kept in a database to begin with no real crawling is needed like a web search engine, and since we are looking at the site in a vacuum (we don't care about inbound links) it makes more sense to rank the results by the amount of times the search query appears in the page, applying more weight for matches in the title or metadata of the page. Using this method, you can create a search feature for a site with around 1,000 pages using a simple 'LIKE' SQL query. (This code snippet is great for counting the amount of times a word shows up in a string in SQL) . Because the computer has to read through every page in the site for each search, once you get past serveral thousand pages in the site, the searches really slow down.

One idea that I came up with to remedy this is to do the same thing the big guys do and index the contents of the of the site by each word it contains. In an absloute worse case, where all of the 100,000 pages contain all 200,000 common english language words that comes out to about 20 Billion records, but if you have a site like that to search, please don't hire me. I haven't been able to compute a more likly case, but I feel like for an average site, with stop words not counted, it would be fair to estimate an index of around 20 - 50 million records (definiatley do-able for mysql on standard hardware from what I've read, but I may be wrong). So basically, you have a batch script that every morning (or whenever) reads through the site, and every time it finds a new word in a page it creates a new record that associates the page to the word and then counts the number of times the word shows up and saves it to help with the result ranking. When you go to search for somthing all you have to do is pull up the records for each word in the page, and sort by the sum of the word counts.

The only problem that comes in is for searching by phrases (like when you use quotes in google search queries). You don't want to index every combination of words because that would be really inefficent and bloat your database, but you can use your word index to shorten your query. With the word index, you only have to search the contents of the pages that contain all the words in the search phrase. This greatly narrows your search, and once it's completed you can cache the phrase and add it to your indexing script so that the engine will learn what phrases to index automatically.

I'm hoping to get an implementation of this working some time this summer, so if anyone has a giant site they need a custom search engine for let me know!

(man I'm a nerd)

0 comments: