So far when we've been looking at text we've been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.
An n-gram is just a n letter long sequence extracted from a document, so for example the word 'constable' in trigrams (3-letter sequences) would break down like this: "con", "ons", "nst", "sta", "tab", "abl", "ble".
There are a lot of ways of extracting these, but a reasonably straightforward function is below. This will extract ngrams, default to 3-grams, from any string passed in.
Looking at words on this level turns out to be a great way of detecting which language a document is written in. There are plenty of algorithms out there, using bi-grams and tri-grams, using different similarity measures, but they all come down to the same idea: collect an example statistical model of the kind of n-grams in each language, then see which one a text sample mostly closely matches.
In our example we're going to use trigrams, and a vector space style cosine similarity. One of the questions that springs up pretty quickly is what to do with spaces - and in this case we ignoring them, generating only trigrams per word (defined as a continuous sequence of letter characters). We also ignoring words less than 3 characters long, just for a bit of consistency.
Only local weights are being considered - simply the term frequency of the trigram. We might be able to improve accuracy by adding a global component such as idf to make overly common trigrams less important, but to be honest the language identification is robust enough that it's arguable whether it's worth the bother for this example- if the range of languages was wide enough, it may be a useful improvement to get fine grained separation.
Below is a small class that implements the search. The two key methods are addDocument, which breaks down the document into trigrams and stores the frequencies against a language, and detect, which breaks down a document in the same way, and for each trigram present compares it's frequency with the test languages. Because we're dividing by the length each time this is a normalised dot product between the two sets of weights, which gives us a score between 0 and 1. We then return the top scoring language.
I didn't have a great selection of training material on my laptop, but I did have the multilingual OSX dictionaries, which provided a pretty decent base. A bit of strip_tags to remove the XML cleaned them up, then the functions above do the rest.
With our index built we can then test with various languages to see what kind of results we get. A big thankyou to Lorenzo, Soila (who speaks a whole lotta languages) and Ivo for the samples below:
As you can see in the (trimmed) output below, each language is properly detected.
The same thing should even work with whole websites, and to test we can use strip_tags again to remove the HTML. I tried it on the websites of the three Ibuildings local offices:
It seems there's more English than Dutch text on the NL homepage, so it registers as English, but the Italian homepage picks up correctly.
While the trigrams method is a neat fix, it isn't necessarily the best technique to use in every situation. For a method that requires no training consider simply compiling a list of stop words, extremely common words such as the, a etc. in English, in various languages and looking for them. This can give you effective language detection at low cost, and with a small data overhead.
Similarly, looking for unicode codepoints that only feature in certain languages can give you excellent accuracy, as long as the codepoints do actually appear. There can be some problems around tokenisation with some non-unicode code pages, which would certainly scupper the trigrams method, but once you have unicode it becomes easier to handle certain issues, and offers more detection possibilities.
When discussing this problem with Lorenzo, he mentioned that there is already a language detection PEAR package, though it is still in alpha. It also implements trigram matching, but with a bit more of a framework around it. As expected it's easy to use, has unicode support, and being that it ships with a trigram index you don't even need to train it! Just for completeness, we've tried the same language samples as above:
As expected, this outputs the same as our parser. This package can be installed straight from PEAR: