Keen to improve the experience for the site visitors we looked at improving the search functionality within the directory.
This started by changing the way we used the full text search functions of MySQL. These changes do seem to have improved the relevancy of listings returned for a search and so have been incorporated into the directory script that powers World Site Index.
The next step in making information more readily available is to pull the different search options together as at the moment searching the directory doesn’t return results from the blog and neither will a search within the blog return pages from the directory. As World Site Index grows and more information is added this separation of searches will become more of a problem.
We have the option of adding a dropdown list to the search form so visitors can select the area to search within but this doesn’t solve the issue of being able to search across all areas. We could use Google API or AdSense for Search but then we would be at the mercy of waiting for them to spider new content, granted it should only ever be a few days delay.
The other alternative is to run a search engine on our own servers to index our content on demand, possibly injecting updates directly into the index bypassing the spider, this is obviously more work but does seem a better option as it can be bent to our needs.
The first step in this endeavour is a visit to our favourite search engine to do some research on stemming algorithms, anything previously written on search and a look for ready to go engines.
Stemming is the process of removing endings from English words to find the stem or root form, this results in a smaller dictionary and allows recognition of potentially similar words. [See, The Porter Stemming Algorithm].
As we need to collect information from the different parts that make up World Site Index and not all the code is created by us a simple bot will be needed, at first glance this seems simple enough, however you quickly realise that it will need to collect robots.txt if available and check for and obey meta robots tag within any pages it retrieves.
The simple bot just got a whole lot more complex, especially when you add to this caching of the robots.txt to save repeated hits and locking so only one spider at a time can index an area. We may wish to index other sites or portions of sites and we wouldn’t be popular if many bots hit the same site at once.
You might be wondering why we don’t just use one of the free search engines available on the net, e.g. phpDig or Sphider, well apart from the fact that there’s no fun in that, building our own allows us to integrate and use the code in anyway we wish. It also allows us to have complete understanding of how the engine works.