So far when we've been looking at text we've been breaking it down into words, albeit with varying degrees of preprocessing, and using the word as our token or term. However, there is quite a lot of mileage in comparing other units of text, for example the letter n-gram, which can prove effective in a variety of applications.
An n-gram is just a n letter long sequence extracted from a document, so for example the word 'constable' in trigrams (3-letter sequences) would break down like this: "con", "ons", "nst", "sta", "tab", "abl", "ble".
There are a lot of ways of extracting these, but a reasonably straightforward function is below. This will extract ngrams, default to 3-grams, from any string passed in.
<?php
function getNgrams($word, $n = 3) {
$ngrams = array();
$len = strlen($word);
for($i = 0; $i < $len; $i++) {
if($i > ($n - 2)) {
$ng = '';
for($j = $n-1; $j >= 0; $j--) {
$ng .= $word[$i-$j];
}
$ngrams[] = $ng;
}
}
return $ngrams;
}
?>
Looking at words on this level turns out to be a great way of detecting which language a document is written in. There are plenty of algorithms out there, using bi-grams and tri-grams, using different similarity measures, but they all come down to the same idea: collect an example statistical model of the kind of n-grams in each language, then see which one a text sample mostly closely matches.
In our example we're going to use trigrams, and a vector space style cosine similarity. One of the questions that springs up pretty quickly is what to do with spaces - and in this case we ignoring them, generating only trigrams per word (defined as a continuous sequence of letter characters). We also ignoring words less than 3 characters long, just for a bit of consistency.
Only local weights are being considered - simply the term frequency of the trigram. We might be able to improve accuracy by adding a global component such as idf to make overly common trigrams less important, but to be honest the language identification is robust enough that it's arguable whether it's worth the bother for this example- if the range of languages was wide enough, it may be a useful improvement to get fine grained separation.
Below is a small class that implements the search. The two key methods are addDocument, which breaks down the document into trigrams and stores the frequencies against a language, and detect, which breaks down a document in the same way, and for each trigram present compares it's frequency with the test languages. Because we're dividing by the length each time this is a normalised dot product between the two sets of weights, which gives us a score between 0 and 1. We then return the top scoring language.
<?php
class LangDetector {
private $index = array();
private $languages = array();
public function addDocument($document, $language) {
if(!isset($this->languages[$language])) {
$this->languages[$language] = 0;
}
$words = $this->getWords($document);
foreach($words as $match) {
$trigrams = $this->getNgrams($match);
foreach($trigrams as $trigram) {
if(!isset($this->index[$trigram])) {
$this->index[$trigram] = array();
}
if(!isset($this->index[$trigram][$language])) {
$this->index[$trigram][$language] = 0;
}
$this->index[$trigram][$language]++;
}
$this->languages[$language] += count($trigrams);
}
}
public function detect($document) {
$words = $this->getWords($document);
$trigrams = array();
foreach($words as $word) {
foreach($this->getNgrams($word) as $trigram) {
if(!isset($trigrams[$trigram])) {
$trigrams[$trigram] = 0;
}
$trigrams[$trigram]++;
}
}
$total = array_sum($trigrams);
$scores = array();
foreach($trigrams as $trigram => $count) {
if(!isset($this->index[$trigram])) {
continue;
}
foreach($this->index[$trigram] as $language => $lCount) {
if(!isset($scores[$language])) {
$scores[$language] = 0;
}
$score = ($lCount / $this->languages[$language])
* ($count / $total);
$scores[$language] += $score;
}
}
arsort($scores);
return key($scores);
}
private function getWords($document) {
$document = strtolower($document);
preg_match_all('/\w+/', $document, $matches);
return $matches[0];
}
private function getNgrams($match, $n = 3) {
$ngrams = array();
$len = strlen($match);
for($i = 0; $i < $len; $i++) {
if($i > ($n - 2)) {
$ng = '';
for($j = $n-1; $j >= 0; $j--) {
$ng .= $match[$i-$j];
}
$ngrams[] = $ng;
}
}
return $ngrams;
}
}
?>
I didn't have a great selection of training material on my laptop, but I did have the multilingual OSX dictionaries, which provided a pretty decent base. A bit of strip_tags to remove the XML cleaned them up, then the functions above do the rest.
<?php
$lang = new LangDetector();
$dir = "/Library/Dictionaries/Apple Dictionary.dictionary/Contents/Resources/";
$dutch = strip_tags(file_get_contents($dir . "Dutch.lproj/Body.data"));
$lang->adddocument($dutch, 'dutch');
$english = strip_tags(file_get_contents($dir . "English.lproj/Body.data"));
$lang->adddocument($english, 'english');
$finnish = strip_tags(file_get_contents($dir . "fi.lproj/Body.data"));
$lang->adddocument($finnish, 'finnish');
$spanish = strip_tags(file_get_contents($dir . "Spanish.lproj/Body.data"));
$lang->adddocument($spanish, 'spanish');
$italian = strip_tags(file_get_contents($dir . "Italian.lproj/Body.data"));
$lang->adddocument($italian, 'italian');
$french = strip_tags(file_get_contents($dir . "French.lproj/Body.data"));
$lang->adddocument($french, 'french');
$swedish = strip_tags(file_get_contents($dir . "sv.lproj/Body.data"));
$lang->adddocument($swedish, 'swedish');
?>
With our index built we can then test with various languages to see what kind of results we get. A big thankyou to Lorenzo, Soila (who speaks a whole lotta languages) and Ivo for the samples below:
<?php
$italian = "
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
ché la diritta via era smarrita.
";
echo $italian, "\n", "is ", $lang->detect($italian), "\n";
$finnish = "
Suomalainen on sellainen, joka vastaa kun ei kysytä,
kysyy kun ei vastata, ei vastaa kun kysytään,
sellainen, joka eksyy tieltä, huutaa rannalla
ja vastarannalla huutaa toinen samanlainen.
";
echo $finnish, "\n", "is ", $lang->detect($finnish), "\n";
$dutch = "
zoals het klokje thuis tikt, tikt het nergens
";
echo $dutch, "\n", "is ", $lang->detect($dutch), "\n";
$spanish = "
Por qué los inmensos aviones
No se pasean com sus hijos?
Cuál es el pájaro amarillo
Que llena el nido de limones?
Por qué no enseñan a sacar
Miel del sol a los helicópteros?
";
echo $spanish, "\n", "is ", $lang->detect($spanish), "\n";
$swedish = "
Och knyttet tog av skorna och suckade och sa:
hur kan det kännas sorgesamt fast allting är så bra?
Men vem ska trösta knyttet med att säga: lilla vän,
vad gör man med en snäcka om man ej får visa den?
";
echo $swedish, "\n", "is ", $lang->detect($swedish), "\n";
?>
As you can see in the (trimmed) output below, each language is properly detected.
Nel mezzo del cammin... is italian Suomalainen on sellainen... is finnish zoals het klokje thuis tikt, tikt het nergens is dutch Por qué los inmensos... is spanish Och knyttet tog av... is swedish
The same thing should even work with whole websites, and to test we can use strip_tags again to remove the HTML. I tried it on the websites of the three Ibuildings local offices:
<?php
$nl = strip_tags(file_get_contents('http://www.ibuildings.nl'));
echo "IB NL reads as: " . $lang->detect($nl), "\n";
$uk = strip_tags(file_get_contents('http://www.ibuildings.co.uk'));
echo "IB Uk reads as: " . $lang->detect($uk), "\n";
$it = strip_tags(file_get_contents('http://www.ibuildings.it'));
echo "IB IT reads as: " . $lang->detect($it), "\n";
?>
It seems there's more English than Dutch text on the NL homepage, so it registers as English, but the Italian homepage picks up correctly.
IB NL reads as: english IB UK reads as: english IB IT reads as: italian
While the trigrams method is a neat fix, it isn't necessarily the best technique to use in every situation. For a method that requires no training consider simply compiling a list of stop words, extremely common words such as the, a etc. in English, in various languages and looking for them. This can give you effective language detection at low cost, and with a small data overhead.
Similarly, looking for unicode codepoints that only feature in certain languages can give you excellent accuracy, as long as the codepoints do actually appear. There can be some problems around tokenisation with some non-unicode code pages, which would certainly scupper the trigrams method, but once you have unicode it becomes easier to handle certain issues, and offers more detection possibilities.
When discussing this problem with Lorenzo, he mentioned that there is already a language detection PEAR package, though it is still in alpha. It also implements trigram matching, but with a bit more of a framework around it. As expected it's easy to use, has unicode support, and being that it ships with a trigram index you don't even need to train it! Just for completeness, we've tried the same language samples as above:
<?php
require_once 'Text/LanguageDetect.php';
function detect($text, $l) {
$result = $l->detect($text, 1);
if (PEAR::isError($result)) {
return $result->getMessage();
} else {
return key($result);
}
}
$l = new Text_LanguageDetect();
$italian = "
Nel mezzo del cammin di nostra vita
mi ritrovai per una selva oscura
ché la diritta via era smarrita.
";
echo $italian, "\n", "is ", detect($italian, $l), "\n";
// ...the rest removed for brevity, but would be as before
?>
As expected, this outputs the same as our parser. This package can be installed straight from PEAR:
pear -d preferred_state=alpha install Text_LanguageDetect
Happy detecting!
Bastian
November 11th, 2009 at 11:23
Very nice article. I've played around with a similar topic by analyzing text to extract keywords and text-clustering. This language detection could be of some help for unknown language text, to don't mix up languages in keywords.
Wahid Afghan
November 11th, 2009 at 12:26
You know now some words of a language before you can detect it, right? That is not suitable for real life use.
I use the api of http://langid.net/ and works pretty well.
Tomek
November 11th, 2009 at 14:02
I believe in the function getNgrams() first parameter should be $match, not $word
typografia
November 11th, 2009 at 15:24
It works surprisingly well even for small portions of sample data.
But a lot of benchmarks and optimizations has to be done.
The key is to determine the order in which the testing should be executed.
The program may try to detect stop words, or language specific characters first, prior to using ngrams.
Also using regular expressions isn't the best solution for tasks as simple as extracting words.
BTW: There's a typo in the code: $scores[$language] == 0;
System
November 14th, 2009 at 16:06
Complicated but works great.
Alex Snet
November 15th, 2009 at 10:25
This is great!
How I can think — we can use any type of dictionary? Not just Apple one.
Can we use a simple word list with \n delimiter?
網站製作學習誌 » [Web] 連結分享
November 28th, 2009 at 02:50
...ttp://blog.roodo.com/rocksaying/archives/10657709.html">PHP的中介編程與反射能力示範
Language Detection With N-Grams
Xdebug and tracing memory usa...
Ian Barber
December 1st, 2009 at 23:19
Wahid - You do have to learn from some text, but that'll be true of any system, langid will have had some training data somewhere. You only have to do it once though!
Alex - yep, any largish chunk of text in each language would do. Wikipedia is a good source, but just about anything of decent size would work.
typografia - that's a good point, the order of the tests would definitely be worth investigating for a production system, could have a big effect on accuracy. Thanks for spotting the typo too!
Kate
December 17th, 2009 at 08:52
Nice article.
The code could use at least basic optimization; e.g. in the first function:
for($i = 0; $i strlen($match); $i++) {
if($i > ($n - 2)) {
can be replaced with:
$len = strlen($match);
for($i = $n-1; $i $len; $i++) {
No need to calculate strlen each time the condition is checked, no need for 'if' inside. The function also requires only one loop actually (it would probably be faster that way).
And so on.
Ian Barber
December 17th, 2009 at 09:08
Good suggestions Kate, that's a nice way of cutting out the if!
Peter
July 18th, 2012 at 04:52
How did you handle the multibyte strings?
When i use getNgrams mu array contains [94] => ��s [95] => �s [96] => s s
Ian Barber
July 18th, 2012 at 20:10
Well, the output will need to be in the proper character encoding, and you may run into difficulties with what defines a word boundary and so on depending on the language. Overall though, it doesn't really matter - we're not treating the strings like characters, just as "things" to be counted and ranked.
Nick Snels
February 26th, 2013 at 19:46
I have written a web service that can identify 100+ languages. It can be used by any programming language, including PHP. It can take texts, websites and files as input and it outputs the identified language as an XML or JSON object. You can test it at http://www.whatlanguage.net