Information Retrieval and other interesting topics

Alternative Term Weighting

In: ranking, probability

11 Nov 2009

The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We're looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.

Spelling Correction


11 Nov 2009

Peter Norvig's spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google 'Did You Mean' spelling suggestion functionality. There's an excellent explanation of the background in his article, so I'll skim over the ideas and how you might implement them in PHP.

Simple Collaborative Filtering

In: statistical

10 Oct 2009

Collaborative filtering is a way of trying to present more relevant information to users, by choosing what to show based on how other users have acted. The "You might also like" boxes on Amazon and similar are probably the most popular example of this kind of technology, but it's applicable in recommending all kind of content outside of ecommerce, particular anything that often has user ratings, such as video or games.

Shingling - Near Duplicate Detection

In: clustering

10 Oct 2009

Determining whether two documents are exactly the same is pretty easy, just use some suitably sized hash and look for a match. A document will generally only hash to the same as another though if they are identical - the smallest change, or the same content on another site with a different header and footer, for example, will cause the hash to be quite different. These near duplicates are very common, and being able to detect them can be useful in a whole range of situations. Shingling is one process for relatively cheaply detecting these duplicates.

Text Classification (And Twitter)

In: classification

09 Sep 2009

Classification techniques are used for spam filters, author identification, intrusion detection and a host of other applications. They can be used to help organise data into a structure, or to add tags to allow users to find documents. While the latest classification algorithms are at the cutting edge of machine learning, there are still thousands of systems using simpler algorithms to great effect.