Language Detection With N-Grams

· November 11, 2009

vector space language

So far when we’ve been looking at text we’ve been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.

An n-gram is just a n letter long sequence extracted from a document, so for example the word ‘constable’ in trigrams (3-letter sequences) would break down like this: “con”, “ons”, “nst”, “sta”, “tab”, “abl”, “ble”.

There are a lot of ways of extracting these, but a reasonably straightforward function is below. This will extract ngrams, default to 3-grams, from any string passed in.

<?php
function getNgrams($word, $n = 3) {
        $ngrams = array();
        $len = strlen($word);
        for($i = 0; $i < $len; $i++) {
                if($i > ($n - 2)) {
                        $ng = '';
                        for($j = $n-1; $j >= 0; $j--) {
                                $ng .= $word[$i-$j];
                        }
                        $ngrams[] = $ng;
                }
        }
        return $ngrams;
}
?>

Language Detection

Looking at words on this level turns out to be a great way of detecting which language a document is written in. There are plenty of algorithms out there, using bi-grams and tri-grams, using different similarity measures, but they all come down to the same idea: collect an example statistical model of the kind of n-grams in each language, then see which one a text sample mostly closely matches.

In our example we’re going to use trigrams, and a vector space style cosine similarity. One of the questions that springs up pretty quickly is what to do with spaces - and in this case we ignoring them, generating only trigrams per word (defined as a continuous sequence of letter characters). We also ignoring words less than 3 characters long, just for a bit of consistency.

Only local weights are being considered - simply the term frequency of the trigram. We might be able to improve accuracy by adding a global component such as idf to make overly common trigrams less important, but to be honest the language identification is robust enough that it’s arguable whether it’s worth the bother for this example- if the range of languages was wide enough, it may be a useful improvement to get fine grained separation.

Below is a small class that implements the search. The two key methods are addDocument, which breaks down the document into trigrams and stores the frequencies against a language, and detect, which breaks down a document in the same way, and for each trigram present compares it’s frequency with the test languages. Because we’re dividing by the length each time this is a normalised dot product between the two sets of weights, which gives us a score between 0 and 1. We then return the top scoring language.

<?php
class LangDetector {
        private $index = array();
        private $languages = array();

        public function addDocument($document, $language) {
                if(!isset($this->languages[$language])) {
                        $this->languages[$language] = 0;
                }

                $words = $this->getWords($document);
                foreach($words as $match) {
                        $trigrams = $this->getNgrams($match);
                        foreach($trigrams as $trigram) {
                                if(!isset($this->index[$trigram])) {
                                        $this->index[$trigram] = array();
                                }
                                if(!isset($this->index[$trigram][$language])) {
                                        $this->index[$trigram][$language] = 0;
                                }
                                $this->index[$trigram][$language]++;
                        }
                        $this->languages[$language] += count($trigrams);
                }
        }

        public function detect($document) {
                $words = $this->getWords($document);
                $trigrams = array();
                foreach($words as $word) {
                        foreach($this->getNgrams($word) as $trigram) {
                                if(!isset($trigrams[$trigram])) {
                                        $trigrams[$trigram] = 0;
                                }
                                $trigrams[$trigram]++;
                        }
                }
                $total = array_sum($trigrams);

                $scores = array();
                foreach($trigrams as $trigram => $count) {
                        if(!isset($this->index[$trigram])) {
                                continue;
                        }
                        foreach($this->index[$trigram] as $language => $lCount) {
                                if(!isset($scores[$language])) {
                                        $scores[$language] = 0;
                                }
                                $score = ($lCount / $this->languages[$language]) 
                                                        * ($count / $total);
                                $scores[$language] += $score;
                        }
                }
                arsort($scores);
                return key($scores);
        }

        private function getWords($document) {
                $document = strtolower($document);
                preg_match_all('/\w+/', $document, $matches);
                return $matches[0];
        }

        private function getNgrams($match, $n = 3) {
                $ngrams = array();
                $len = strlen($match);
                for($i = 0; $i < $len; $i++) {
                        if($i > ($n - 2)) {
                                $ng = '';
                                for($j = $n-1; $j >= 0; $j--) {
                                        $ng .= $match[$i-$j];
                                }
                                $ngrams[] = $ng;
                        }
                }
                return $ngrams;
        }
}
?>

I didn’t have a great selection of training material on my laptop, but I did have the multilingual OSX dictionaries, which provided a pretty decent base. A bit of strip_tags to remove the XML cleaned them up, then the functions above do the rest.

<?php
$lang = new LangDetector();
$dir = "/Library/Dictionaries/Apple Dictionary.dictionary/Contents/Resources/";
$dutch = strip_tags(file_get_contents($dir . "Dutch.lproj/Body.data"));
$lang->adddocument($dutch, 'dutch');
$english = strip_tags(file_get_contents($dir . "English.lproj/Body.data"));
$lang->adddocument($english, 'english');
$finnish = strip_tags(file_get_contents($dir . "fi.lproj/Body.data"));
$lang->adddocument($finnish, 'finnish');
$spanish = strip_tags(file_get_contents($dir . "Spanish.lproj/Body.data"));
$lang->adddocument($spanish, 'spanish');
$italian = strip_tags(file_get_contents($dir . "Italian.lproj/Body.data"));
$lang->adddocument($italian, 'italian');
$french = strip_tags(file_get_contents($dir . "French.lproj/Body.data"));
$lang->adddocument($french, 'french');
$swedish = strip_tags(file_get_contents($dir . "sv.lproj/Body.data"));
$lang->adddocument($swedish, 'swedish');
?>

With our index built we can then test with various languages to see what kind of results we get. A big thankyou to Lorenzo, Soila (who speaks a whole lotta languages) and Ivo for the samples below:

<?php
$italian = "
Nel mezzo del cammin di nostra vita
 mi ritrovai per una selva oscura
 ché la diritta via era smarrita.
";
echo $italian, "\n", "is ", $lang->detect($italian), "\n";

$finnish = "
Suomalainen on sellainen, joka vastaa kun ei kysytä,
kysyy kun ei vastata, ei vastaa kun kysytään,
sellainen, joka eksyy tieltä, huutaa rannalla
ja vastarannalla huutaa toinen samanlainen.
";
echo $finnish, "\n", "is ", $lang->detect($finnish), "\n";

$dutch = "
zoals het klokje thuis tikt, tikt het nergens
";
echo $dutch, "\n", "is ", $lang->detect($dutch), "\n";

$spanish = "
Por qué los inmensos aviones
No se pasean com sus hijos?
Cuál es el pájaro amarillo
Que llena el nido de limones?
Por qué no enseñan a sacar
Miel del sol a los helicópteros?
";
echo $spanish, "\n", "is ", $lang->detect($spanish), "\n";

$swedish = "
Och knyttet tog av skorna och suckade och sa:
hur kan det kännas sorgesamt fast allting är så bra?
Men vem ska trösta knyttet med att säga: lilla vän,
vad gör man med en snäcka om man ej får visa den?
";
echo $swedish, "\n", "is ", $lang->detect($swedish), "\n";
?>

As you can see in the (trimmed) output below, each language is properly detected.

Nel mezzo del cammin...
is italian

Suomalainen on sellainen...
is finnish

zoals het klokje thuis tikt, tikt het nergens
is dutch

Por qué los inmensos...
is spanish

Och knyttet tog av...
is swedish

The same thing should even work with whole websites, and to test we can use strip_tags again to remove the HTML. I tried it on the websites of the three Ibuildings local offices:

<?php
$nl = strip_tags(file_get_contents('http://www.ibuildings.nl'));
echo "IB NL reads as: " . $lang->detect($nl), "\n";

$uk = strip_tags(file_get_contents('http://www.ibuildings.co.uk'));
echo "IB Uk reads as: " . $lang->detect($uk), "\n";

$it = strip_tags(file_get_contents('http://www.ibuildings.it'));
echo "IB IT reads as: " . $lang->detect($it), "\n";
?>

It seems there’s more English than Dutch text on the NL homepage, so it registers as English, but the Italian homepage picks up correctly.

IB NL reads as: english
IB UK reads as: english
IB IT reads as: italian

Other Methods

While the trigrams method is a neat fix, it isn’t necessarily the best technique to use in every situation. For a method that requires no training consider simply compiling a list of stop words, extremely common words such as the, a etc. in English, in various languages and looking for them. This can give you effective language detection at low cost, and with a small data overhead.

Similarly, looking for unicode codepoints that only feature in certain languages can give you excellent accuracy, as long as the codepoints do actually appear. There can be some problems around tokenisation with some non-unicode code pages, which would certainly scupper the trigrams method, but once you have unicode it becomes easier to handle certain issues, and offers more detection possibilities.

Text_LanguageDetect

When discussing this problem with Lorenzo, he mentioned that there is already a language detection PEAR package, though it is still in alpha. It also implements trigram matching, but with a bit more of a framework around it. As expected it’s easy to use, has unicode support, and being that it ships with a trigram index you don’t even need to train it! Just for completeness, we’ve tried the same language samples as above:

<?php
require_once 'Text/LanguageDetect.php';

function detect($text, $l) {
        $result = $l->detect($text, 1);
        if (PEAR::isError($result)) {
            return $result->getMessage();
        } else {
                return key($result);
        }
}

$l = new Text_LanguageDetect();

$italian = "
Nel mezzo del cammin di nostra vita
 mi ritrovai per una selva oscura
 ché la diritta via era smarrita.
";
echo $italian, "\n", "is ", detect($italian, $l), "\n";

// ...the rest removed for brevity, but would be as before
?>

As expected, this outputs the same as our parser. This package can be installed straight from PEAR:

pear -d preferred_state=alpha install Text_LanguageDetect Happy detecting!