Peter Norvig's spelling corrector is an interesting example of using some statistical techniques for the very practical purpose of spelling correction, inspired by a conversation on the Google 'Did You Mean' spelling suggestion functionality. There's an excellent explanation of the background in his article, so I'll skim over the ideas and how you might implement them in PHP.
Norvig's solution is to determine the most likely spelling correction by collecting statistics on a a sample of the language. This starts with big file of text he has collected from the Sherlock Holmes books among other places, which you can get from his site as big.txt. This is the corpus from which we will extract our list of 'known' words, and how frequently different words appear. We store each word in an array along with it's frequency count.
<?php
After running that function we have an array of words to word counts. We can work on the basis that a word that is more common is more likely to be a correction than a word that is less common. Of course, we need something else because we don't just want to just correct everything to the most common word, so we consider the edit distance between two words.
This is just a count of the number of letters than need to be inserted, deleted, modified, or moved around within one word to turn it into a second word. One way of defining this is referred to as the Levenshtein distance, for which there is a convenient PHP function. In our case we will be using this to see how many edits are needed to turn the supplied word into a given word from the corpus.
Norvig's method is to consider words with an edit distance of 1 or 2, but to assume words of edit distance 1 are infinitely more likely to be right than words of distance 2. To support this we calculate the edit distance of every word and only keep those that are 1 or 2 away, but then only considering the distance 2 words if there are no distance 1 words available.
<?php
This seems like a pretty good function, and querying doesn't take that long (despite pretty much having to loop through all the words seen). We can use the test words in the original article to evaluate the effectiveness - it's just a replace of : to => and a change of the brackets to turn the python list into a PHP array. We can use that with a simple test function which loops through the words and checks what the correction function comes back with:
<?php
On the 270 test words, our accuracy is 71% with this code, which is slightly worse than Norvig's code at 74%. That may be due to differences in the tokenisation, or ordering of terms with the same frequency count. It runs If the dictionary was precaculated though, then a case of reading it in and performing the corrections could be a very quick task, making a local implementation of a spelling corrector pretty straightforward.
UPDATE: If you're interested in improving the accuracy further, check out Vincenzo's blog post on using soundex as part of the corrector.
Twitter Trackbacks for Spelling Correction - PHP/ir [phpir.com] on Topsy.com
November 1st, 2009 at 20:15
...orrection%20-%20PHP%2Fir%20http%3A%2F%2Ftopsy.com%2Ftb%2Fbit.ly%2F1KAexS" target="_blank">Tweet Spelling Correction - PHP/ir phpir.com/spelling-correction – ...
Vincenzo Russo
November 6th, 2009 at 13:22
I've just slightly modified this to make use also of the Soundex algorithm (in addition to this logic). I've got an accuracy of the 83%.
If we consider that we are not applying stemming and other thing you are perfectly aware of, I reckon that accuracy can be further increased.
Right now I am working (as you know). I'll come back to you with more details (which means code).
Cheers,
V.
Domus Neminis » Spelling correction with Soundex
November 6th, 2009 at 17:36
...respond" title="Comment on Spelling correction with Soundex">Comment! A few days ago Ian Barber an article about the automated spelling correction.
Today I had the time to read it. Good quality and great presentation as always. However, the first...
moleskine » Blog Archive » links for 2009-11-06
November 7th, 2009 at 04:08
...ds/uex.html">Download UEx, UltraEdit text editor for Linux
UEx, UltraEdit text editor for Linux Spelling Correction - PHP/ir
Corrector ortográfico en PHP Write a comment ...
Chris
November 20th, 2009 at 14:03
Peter Norvig gave a talk some time ago at one of the Facebook Tech Talks (viedo here: http://www.facebook.com/video/video.php?v=644326502463). His talk was about spelling correction and language translation using statistical techniques. He also talked about how the larger the corpus is, the better the results, and the need to have diverse text for the corpus. All of Norvig's examples are in Python, but it is still a good video.
Ian Barber
December 1st, 2009 at 23:32
Thanks Chris, video looks good.