Alternative Term Weighting

· November 6, 2009

The term weighting and ranking function is at the core of any information retrieval system. The vector space model with the cosine similarity is maybe the best known and most widely used, but there are plenty of alternatives. We’re looking at two here, the BM25 function based around a probabilistic model, and a function based around language modeling.

Just to get something to work with we’ll we’ll build a quick index on some strings which will stand in for our documents - for a real world usage this structure would be too simple, but it provides the bits the ranking functions need. We build a quick inverted index of term => document ID mappings, and store the count of the number of times each term was seen in each document.

We collect a couple of statistics you wouldn’t necessarily always need, such as the total and average number of tokens in the collection, and number in each document, to allow us to do some length related operations. The code below gives us the $collection and $queryTerms arrays, which are used in the two ranking functions.

<?php
        $docs = array(
                "d1" => "this document is the first document that is quite long",
                "d2" => "this is yet another document that is very slightly longer",
                "d3" => "this isn't a very interesting string",
                "d4" => "this isn't a very interesting document either"
        );
        $query = 'interesting document';
        preg_match_all('/\w+/', $query, $matches);
        $queryTerms = $matches[0];

        $collection = array('terms' => array(), 'length' => 0, 'documents' => array());
        foreach($docs as $docID => $doc) {
                preg_match_all('/\w+/', $doc, $matches);
                // store the document length
                $collection['documents'][$docID] = count($matches[0]);

                foreach($matches[0] as $match) {
                        if(!isset($collection['terms'][$match])) {
                                $collection['terms'][$match] = array();
                        }
                        if(!isset($collection['terms'][$match][$docID])) {
                                $collection['terms'][$match][$docID] = 0;
                        }
                        $collection['terms'][$match][$docID]++;
                        $collection['length']++;
                }        
        }
        $collection['averageLength'] = $collection['length']/count($collection['documents']);
?>

Okapi/BM25

The BM25 model is a probabilistic system. The model is driven by looking at the uncertainty inherent in returning results to a user - are they what they wanted, was it what the user meant by their query. Instead of trying to rank by document similarity to a query, what we really want to rank results by is their probability of being relevant to the user - which of course is a hard figure to estimate (otherwise IR wouldn’t be a very interesting field).

As usual we make some simplifying assumptions, particularly an independence assumption. This lets us pretend that the probability of a document being relevant to a user is the same as the product of the probabilities of each of the terms in the document being relevant to the user. We also use the regular Inverse Document Frequency function as an estimate for percentage of documents in the collection that are relevant - a very common term will be in a lot of documents, most of which wont be relevant, and it will have a low IDF that indicates that.

The formula integrates length weighting and smoothing to create a more complex ranking. The algorithm also support some tuning parameters, which I’ve labelled tfWeight, for weighting the importance of the term frequency part, and a dlWeight for weighting the importance of the document length.

<?php
        function bm25Weight($query, $collection, $tfWeight = 1, $dlWeight = 0.5) {
                $docScores = array();
                $count = count($collection['documents']);
                foreach($query as $term) {
                        $df = count($collection['terms'][$term]);
                        foreach($collection['terms'][$term] as $docID => $tf) {
                                $docLength = $collection['documents'][$docID];
                                $idf = log($count/$df);
                                $num = ($tfWeight + 1) * $tf;
                                $denom = $tfWeight
                                        * ((1 - $dlWeight) + $dlWeight * 
                                                ($docLength / $collection['averageLength']))
                                        + $tf;
                                $score = $idf * ($num/$denom);
                                $docScores[$docID] = isset($docScores[$docID]) ? 
                                                        $docScores[$docID] + $score : $score;
                        }
                }
                return $docScores;
        }
?>

We can use this as below:

<?php
        $results = bm25Weight($queryTerms, $collection);
        arsort($results);
        var_dump($results);
?>

This weighting scheme has been very successfully experimentally, and is used as the core scheme in Xapian (with a few more tuning parameters) for example, but there are plenty of extensions. The IDF, as mentioned before, is an approximation to the percentage of the term occurring in non-relevant documents across the collection. If, however, we have used some from of relevance feedback it is easy to integrate that into the above formula and take advantage of the (hopefully improved) estimations.

Language Modelling

Languages can be modelled as a sort of finite state machine - a system which is defined as a number of nodes, which represent states, and links between them which indicate valid state transitions. A very simple model might be a machine that emits a single word (from a list of various words and chances of those words appearing), and can loop back to itself or stop. A more complex model might have some sentence structure embedded in it, so that one type of word is more likely to follow another type, but it’s the simple case that we are using as part of a query ranking system.

The observation we can take advantage of is that people often search for documents based on what they think is going to be in them - they are trying to write a query that should be found in the document. We can create a one-node language model by looking at a given document and using it to extract the probabilities of different words being emiited. We then can look at all the language models we’ve generated and determine how likely it is that each would generate the words in the query. The ones most likely to generate the query are therefore ranked highest.

We can determine the probability of each term in the query being generated by the language model by dividing the frequency of the term in the document by the length of the document , maximum likelihood estimation. Of course, the problem is that if a document doesn’t have a term in we will get 0 using this formula, so we have to add some smoothing. We can do this by working out a similar MLE value for the whole collection, giving us the function below. Note the tuning parameter of alpha, which weights the smoothing.

<?php
        function lmWeight($queryTerms, $collection, $alpha = 0.5) {
                $docScores = array();
                // get the appropriate documents by merging based on the query terms
                $documents = array();
                foreach($queryTerms as $term) {
                        // note that the keys should be strings, or you'll need array_unique
                        $documents = array_merge($documents, $collection['terms'][$term]);
                }

                $documents = array_keys($documents);
                foreach($queryTerms as $term) {
                        // Get a smoothing number by looking at probability of this term
                        $smooth = array_sum($collection['terms'][$term]) 
                                                / $collection['length'];

                        // Calculate a score for each document
                        foreach($documents  as $docID) {
                                $tf = isset($collection['terms'][$term][$docID]) ?
                                        $collection['terms'][$term][$docID] :
                                        0;
                                $dl = $collection['documents'][$docID];
                                $score = ($tf + ($alpha * $smooth))
                                                / ($dl + $alpha);
                                $docScores[$docID] = isset($docScores[$docID]) ? 
                                                $docScores[$docID] * $score : $score;
                        }
                }
                return $docScores;
        }
?>

This function can be run as the the BM25 one, and both return the same ranking for this toy case, with document d4 coming out top.

There are a lot of other ways language modelling can be used, including by treating query as the basis of a model as well and comparing how well one model matches the other. It is also possible to introduce other models, such as a translation model to allow mappings between different words with similar meaning, or between different languages, which can make life easier.

Both the models are worth considering when looking at IR systems however, and each comes with it’s own body of theoretical justification and motivation that can make some interested (if complicated!) reading.