The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We're looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.
Just to get something to work with we'll we'll build a quick index on some strings which will stand in for our documents - for a real world usage this structure would be too simple, but it provides the bits the ranking functions need. We build a quick inverted index of term => document ID mappings, and store the count of the number of times each term was seen in each document.
We collect a couple of statistics you wouldn't necessarily always need, such as the total and average number of tokens in the collection, and number in each document, to allow us to do some length related operations. The code below gives us the $collection and $queryTerms arrays, which are used in the two ranking functions.
<?php
$docs = array(
"d1" => "this document is the first document that is quite long",
"d2" => "this is yet another document that is very slightly longer",
"d3" => "this isn't a very interesting string",
"d4" => "this isn't a very interesting document either"
);
$query = 'interesting document';
preg_match_all('/\w+/', $query, $matches);
$queryTerms = $matches[0];
$collection = array('terms' => array(), 'length' => 0, 'documents' => array());
foreach($docs as $docID => $doc) {
preg_match_all('/\w+/', $doc, $matches);
// store the document length
$collection['documents'][$docID] = count($matches[0]);
foreach($matches[0] as $match) {
if(!isset($collection['terms'][$match])) {
$collection['terms'][$match] = array();
}
if(!isset($collection['terms'][$match][$docID])) {
$collection['terms'][$match][$docID] = 0;
}
$collection['terms'][$match][$docID]++;
$collection['length']++;
}
}
$collection['averageLength'] = $collection['length']/count($collection['documents']);
?>
The BM25 model is a probabilistic system. The model is driven by looking at the uncertainty inherent in returning results to a user - are they what they wanted, was it what the user meant by their query. Instead of trying to rank by document similarity to a query, what we really want to rank results by is their probability of being relevant to the user - which of course is a hard figure to estimate (otherwise IR wouldn't be a very interesting field).
As usual we make some simplifying assumptions, particularly an independence assumption. This lets us pretend that the probability of a document being relevant to a user is the same as the product of the probabilities of each of the terms in the document being relevant to the user. We also use the regular Inverse Document Frequency function as an estimate for percentage of documents in the collection that are relevant - a very common term will be in a lot of documents, most of which wont be relevant, and it will have a low IDF that indicates that.
The formula integrates length weighting and smoothing to create a more complex ranking. The algorithm also support some tuning parameters, which I've labelled tfWeight, for weighting the importance of the term frequency part, and a dlWeight for weighting the importance of the document length.
<?php
function bm25Weight($query, $collection, $tfWeight = 1, $dlWeight = 0.5) {
$docScores = array();
$count = count($collection['documents']);
foreach($query as $term) {
$df = count($collection['terms'][$term]);
foreach($collection['terms'][$term] as $docID => $tf) {
$docLength = $collection['documents'][$docID];
$idf = log($count/$df);
$num = ($tfWeight + 1) * $tf;
$denom = $tfWeight
* ((1 - $dlWeight) + $dlWeight *
($docLength / $collection['averageLength']))
+ $tf;
$score = $idf * ($num/$denom);
$docScores[$docID] = isset($docScores[$docID]) ?
$docScores[$docID] + $score : $score;
}
}
return $docScores;
}
?>
We can use this as below:
<?php
$results = bm25Weight($queryTerms, $collection);
arsort($results);
var_dump($results);
?>
This weighting scheme has been very successfully experimentally, and is used as the core scheme in Xapian (with a few more tuning parameters) for example, but there are plenty of extensions. The IDF, as mentioned before, is an approximation to the percentage of the term occurring in non-relevant documents across the collection. If, however, we have used some from of relevance feedback it is easy to integrate that into the above formula and take advantage of the (hopefully improved) estimations.
Languages can be modelled as a sort of finite state machine - a system which is defined as a number of nodes, which represent states, and links between them which indicate valid state transitions. A very simple model might be a machine that emits a single word (from a list of various words and chances of those words appearing), and can loop back to itself or stop. A more complex model might have some sentence structure embedded in it, so that one type of word is more likely to follow another type, but it's the simple case that we are using as part of a query ranking system.
The observation we can take advantage of is that people often search for documents based on what they think is going to be in them - they are trying to write a query that should be found in the document. We can create a one-node language model by looking at a given document and using it to extract the probabilities of different words being emiited. We then can look at all the language models we've generated and determine how likely it is that each would generate the words in the query. The ones most likely to generate the query are therefore ranked highest.
We can determine the probability of each term in the query being generated by the language model by dividing the frequency of the term in the document by the length of the document , maximum likelihood estimation. Of course, the problem is that if a document doesn't have a term in we will get 0 using this formula, so we have to add some smoothing. We can do this by working out a similar MLE value for the whole collection, giving us the function below. Note the tuning parameter of alpha, which weights the smoothing.
<?php
function lmWeight($queryTerms, $collection, $alpha = 0.5) {
$docScores = array();
// get the appropriate documents by merging based on the query terms
$documents = array();
foreach($queryTerms as $term) {
// note that the keys should be strings, or you'll need array_unique
$documents = array_merge($documents, $collection['terms'][$term]);
}
$documents = array_keys($documents);
foreach($queryTerms as $term) {
// Get a smoothing number by looking at probability of this term
$smooth = array_sum($collection['terms'][$term])
/ $collection['length'];
// Calculate a score for each document
foreach($documents as $docID) {
$tf = isset($collection['terms'][$term][$docID]) ?
$collection['terms'][$term][$docID] :
0;
$dl = $collection['documents'][$docID];
$score = ($tf + ($alpha * $smooth))
/ ($dl + $alpha);
$docScores[$docID] = isset($docScores[$docID]) ?
$docScores[$docID] * $score : $score;
}
}
return $docScores;
}
?>
This function can be run as the the BM25 one, and both return the same ranking for this toy case, with document d4 coming out top.
There are a lot of other ways language modelling can be used, including by treating query as the basis of a model as well and comparing how well one model matches the other. It is also possible to introduce other models, such as a translation model to allow mappings between different words with similar meaning, or between different languages, which can make life easier.
Both the models are worth considering when looking at IR systems however, and each comes with it's own body of theoretical justification and motivation that can make some interested (if complicated!) reading.
chango
November 6th, 2009 at 22:53
Nice article! I was wondering how this would scale and work with large cms sites.
Ian Barber’s Blog: Alternative Term Weighting | Webs Developer
November 9th, 2009 at 19:00
... Barber’s Blog: Alternative Term WeightingPosted by web slinger on November - 9 - 2009 In this new post from Ian Barber he takes a look at something that can come in very handy when you need something a ...
Ian Barber’s Blog: Alternative Term Weighting | Development Blog With Code Updates : Developercast.com
November 9th, 2009 at 19:08
...ry/development/" title="View all posts in Development" rel="category tag">Development In this new post from Ian Barber he takes a look at something that can come in very handy when you need something a ...
Chris Henry
November 10th, 2009 at 06:53
Really cool code samples. You ought to take a look at Zend's implementation of Lucene. http://framework.zend.com/manual/en/zend.search.lucene.html
In the end though, PHP doesn't really seem to be the language of choice for a serious IR system. My experience with Zend Lucene was that time to build the index way too large and memory consumption during retrieval was too big to scale well. Sphinx, and the Java implementation of Lucene both performed much better.
Ian Barber
December 1st, 2009 at 23:22
Chris - yeah, it's true. I think that there is the potential in PHP to do decent large scale IR, but it has to be done in a PHP-way. Porting something like Lucene is an amazing effort, and I take my hat off to the ZF guys that did it, but I think that the real strength of a native PHP search engine would be that it could interact with a lot of the stuff that PHP already has support for (memcache, gearman, all sorts) if it had them available.
solomon syf
December 22nd, 2009 at 17:07
can i get the VB version of thise plz
Nate Radebaugh
November 26th, 2011 at 21:53
Excellent article!! Thank you so much for it!
A note: The behavior of array_merge() was modified in PHP 5. http://php.net/manual/en/function.array-merge.php
"If you want to append array elements from the second array to the first array while not overwriting the elements from the first array and not re-indexing, use the + array union operator:
The keys from the first array will be preserved. If an array key exists in both arrays, then the element from the first array will be used and the matching key's element from the second array will be ignored."
Nate Radebaugh
November 26th, 2011 at 21:54
so in the lmWeight() method, using PHP5, you'll need to replace "$documents = array_merge($documents, $collection['terms'][$term]);" with a mere $documents = $documents + $collection['terms'][$term];"