Spelling Correction

· November 1, 2009

Peter Norvig’s spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google ‘Did You Mean’ spelling suggestion functionality. There’s an excellent explanation of the background in his article, so I’ll skim over the ideas and how you might implement them in PHP.

Norvig’s solution is to determine the most likely spelling correction by collecting statistics on a a sample of the language. This starts with big file of text he has collected from the Sherlock Holmes books among other places, which you can get from his site as big.txt. This is the corpus from which we will extract our list of ‘known’ words, and how frequently different words appear. We store each word in an array along with it’s frequency count.

<?php
function train($file = 'big.txt') {
        $contents = file_get_contents($file);
        // get all strings of word letters
        preg_match_all('/\w+/', $contents, $matches);
        unset($contents);
        $dictionary = array();
        foreach($matches[0] as $word) {
                $word = strtolower($word);
                if(!isset($dictionary[$word])) {
                        $dictionary[$word] = 0;
                }
                $dictionary[$word] += 1;
        } 
        unset($matches);
        return $dictionary;
}
?>

After running that function we have an array of words to word counts. We can work on the basis that a word that is more common is more likely to be a correction than a word that is less common. Of course, we need something else because we don’t just want to just correct everything to the most common word, so we consider the edit distance between two words.

This is just a count of the number of letters than need to be inserted, deleted, modified, or moved around within one word to turn it into a second word. One way of defining this is referred to as the Levenshtein distance, for which there is a convenient PHP function. In our case we will be using this to see how many edits are needed to turn the supplied word into a given word from the corpus.

Norvig’s method is to consider words with an edit distance of 1 or 2, but to assume words of edit distance 1 are infinitely more likely to be right than words of distance 2. To support this we calculate the edit distance of every word and only keep those that are 1 or 2 away, but then only considering the distance 2 words if there are no distance 1 words available.

<?php
function correct($word, $dictionary) {
        $word = strtolower($word);
        if(isset($dictionary[$word])) {
                return $word;
        }

        $edits1 = $edits2 = array();
        foreach($dictionary as $dictWord => $count) {
                $dist = levenshtein($word, $dictWord); 
                if($dist == 1) {
                        $edits1[$dictWord] = $count;
                } else if($dist == 2) {
                        $edits2[$dictWord] = $count;
                }
        }

        if(count($edits1)) {
                arsort($edits1);
                return key($edits1);
        } else if(count($edits2)) {
                arsort($edits2);
                return key($edits2);
        }

        // Nothing better
        return $word;
}
?>

This seems like a pretty good function, and querying doesn’t take that long (despite pretty much having to loop through all the words seen). We can use the test words in the original article to evaluate the effectiveness - it’s just a replace of : to => and a change of the brackets to turn the python list into a PHP array. We can use that with a simple test function which loops through the words and checks what the correction function comes back with:

<?php 
// Snippit rather than all the items. 'correct' => 'wrong wrong wrong' 
$test = ( 'access' => 'acess', 'accessing' => 'accesing', 'accommodation' =>
    'accomodation acommodation acomodation', 'account' => 'acount');
$good = $bad = 0;

foreach($test as $word => $wrongs) {
        $wrongs = explode(' ', $wrongs); 
        foreach($wrongs as $wrong) {
                if($word == correct($wrong, $dict)) {
                        $good++;
                } else {
                        $bad++;
                }
        }
}
echo "Good: $good\n";
echo "Bad: $bad\n";
echo "Percent: " . round(($good / ($good+$bad)) * 100) . "%\n"
?>

On the 270 test words, our accuracy is 71% with this code, which is slightly worse than Norvig’s code at 74%. That may be due to differences in the tokenisation, or ordering of terms with the same frequency count. It runs If the dictionary was precaculated though, then a case of reading it in and performing the corrections could be a very quick task, making a local implementation of a spelling corrector pretty straightforward.

UPDATE: If you’re interested in improving the accuracy further, check out Vincenzo’s blog post on using soundex as part of the corrector.