Until now, all the posts here have looked at text in a purely statistical way. What the words actually were was less important than how common they were, and whether they occurred in a query or a category. There are plenty of applications, however, where a deeper parsing of the text could be huge beneficial, and the first step in such parsing is often part of speech tagging.
The tags in question are the grammatical parts of speech that the words fall into, the traditional noun, verb, adjective and so on that hopefully most people will dimly remember. Being able to tag a document appropriately is hugely helpful in trying to extract what document is discussing, and in determining other aspects of the text that are self-evident for a human reader, but tricky to determine statistically, particularly with a small number of examples.
The parts of speech are somewhat difficult to work out completely automatically, and even humans can get stuck with words that have many possible interpretations. Almost every system around utilises a corpus, a set of documents that have their words hand tagged (or hand verified) for parts of speech. This can then be used to extract statistics, and build taggers. Because there are many more parts of speech than may come to mind, there are various codes that are used to tag the files, a full list for the common Brown corpus is available on wikipedia. Some examples are NN for noun, NNS for plural noun, VB for verb, VBD for verb past tense, and a tagged string might look like this:
The/DT quick/JJ brown/JJ fox/NN jumped/VBD over/IN the/DT lazy/JJ dog/NN
For our implementation, we'll look at a relatively simple to write tagger invented by Eric Brill in the early nineties. The tagger was trained by analysing a corpus and noting the frequencies of the different tags for a given word. As words were tagged they were assigned to the most frequent tag for the word if it was in the corpus, or tagged as a noun if not. Then a series of transformations were applied, which changed the tag depending on various conditions. The results were compared to known correct tags, and the rules that added the most accuracy retained.
Luckily for us, we can just use the most successful rules, and don't have to reimplement the whole thing. The code here draws from the (many!) implementations of the Brill tagger by Mark Watson in various languages. The rules are pretty straightforward, such making a word a past participle if it ends with 'ed', or an adverb if it ends with 'ly'.
<?php
class PosTagger {
private $dict;
public function __construct($lexicon) {
$fh = fopen($lexicon, 'r');
while($line = fgets($fh)) {
$tags = explode(' ', $line);
$this->dict[strtolower(array_shift($tags))] = $tags;
}
fclose($fh);
}
public function tag($text) {
preg_match_all("/[\w\d\.]+/", $text, $matches);
$nouns = array('NN', 'NNS');
$return = array();
$i = 0;
foreach($matches[0] as $token) {
// default to a common noun
$return[$i] = array('token' => $token, 'tag' => 'NN');
// remove trailing full stops
if(substr($token, -1) == '.') {
$token = preg_replace('/\.+$/', '', $token);
}
// get from dict if set
if(isset($this->dict[strtolower($token)])) {
$return[$i]['tag'] = $this->dict[strtolower($token)][0];
}
// Converts verbs after 'the' to nouns
if($i > 0) {
if($return[$i - 1]['tag'] == 'DT' &&
in_array($return[$i]['tag'],
array('VBD', 'VBP', 'VB'))) {
$return[$i]['tag'] = 'NN';
}
}
// Convert noun to number if . appears
if($return[$i]['tag'][0] == 'N' && strpos($token, '.') !== false) {
$return[$i]['tag'] = 'CD';
}
// Convert noun to past particile if ends with 'ed'
if($return[$i]['tag'][0] == 'N' && substr($token, -2) == 'ed') {
$return[$i]['tag'] = 'VBN';
}
// Anything that ends 'ly' is an adverb
if(substr($token, -2) == 'ly') {
$return[$i]['tag'] = 'RB';
}
// Common noun to adjective if it ends with al
if(in_array($return[$i]['tag'], $nouns)
&& substr($token, -2) == 'al') {
$return[$i]['tag'] = 'JJ';
}
// Noun to verb if the word before is 'would'
if($i > 0) {
if($return[$i]['tag'] == 'NN'
&& strtolower($return[$i-1]['token']) == 'would') {
$return[$i]['tag'] = 'VB';
}
}
// Convert noun to plural if it ends with an s
if($return[$i]['tag'] == 'NN' && substr($token, -1) == 's') {
$return[$i]['tag'] = 'NNS';
}
// Convert common noun to gerund
if(in_array($return[$i]['tag'], $nouns)
&& substr($token, -3) == 'ing') {
$return[$i]['tag'] = 'VBG';
}
// If we get noun noun, and the second can be a verb, convert to verb
if($i > 0) {
if(in_array($return[$i]['tag'], $nouns)
&& in_array($return[$i-1]['tag'], $nouns)
&& isset($this->dict[strtolower($token)])) {
if(in_array('VBN', $this->dict[strtolower($token)])) {
$return[$i]['tag'] = 'VBN';
} else if(in_array('VBZ',
$this->dict[strtolower($token)])) {
$return[$i]['tag'] = 'VBZ';
}
}
}
$i++;
}
return $return;
}
}
?>
The lexicon for the class is available, or could be extracted with some work from the Brown corpus itself. There are bigger corpora available, which could give better results, but at the cost of more processing, and more overhead.
<?php
// little helper function to print the results
function printTag($tags) {
foreach($tags as $t) {
echo $t['token'] . "/" . $t['tag'] . " ";
}
echo "\n";
}
$tagger = new PosTagger('lexicon.txt');
$tags = $tagger->tag('The quick brown fox jumped over the lazy dog');
printTag($tags);
?>
While with the quick brown fox example we got perfect tagging (see the example up above), but for a tougher test we can try this with the grammatical powerhouse that is twitter. While we might not get perfect results, hopefully we should get something in the ballpark, and to keep it interesting we can take a look at the nouns that are tagged to see how they fit the message. Thanks to Sam, Helgi and Johanna for their tweets.
<?php
// @samsoir
$tags = $tagger->tag("Coffee... yes I've said it already today, but it really does keep ones mind fresh and aler [zzzzzzzzZZZZZZZ]");
printTag($tags);
// @h
$tags = $tagger->tag("How can I make twitter not think that @h&m is not a mention to / about me! Gah. I have had enough of these Jimmy Choo and wtf ever things.");
printTag($tags);
// @johannacherry
$tags = $tagger->tag("i think my brain has checked out for the day..i've been playing with my hair and thinking about toothpaste for about 10 minutes now...");
printTag($tags);
?>
Output:
Coffee.../NN yes/UH I/NN ve/NN said/VBD it/PRP already/RB today/NN but/CC it/PRP really/RB does/VBZ keep/VB ones/NNS mind/NN fresh/JJ and/CC aler/NN zzzzzzzzZZZZZZZ/NN
Noun wise, this has picked up Coffee, Today, Ones, Mind and zzzzzzz, which does sum up the message pretty nicely. Notice that the typo of 'alert' is mistagged as is "I've", suffering from the simplicity of the tokeniser.
How/WRB can/MD I/NN make/VB twitter/NN not/RB think/VBP that/IN h/NN m/NN is/VBZ not/RB a/DT mention/NN to/TO about/IN me/PRP Gah./NN I/NN have/VBP had/VBD enough/RB of/IN these/DT Jimmy/NNP Choo/NN and/CC wtf/NN ever/RB things./NNS
Again on the nouns we have: I, twitter, h, m, mention, Gah, I, Jimmy, Choo, wtf, things. Again an extension to the tokeniser could help here, and an addition to the lexicon to get wtf marked as UH (an interjection or exclamation).
i/NN think/VBP my/PRP$ brain/NN has/VBZ checked/VBN out/IN for/IN the/DT day.. i/CD ve/NN been/VBN playing/VBG with/IN my/PRP$ hair/NN and/CC thinking/VBG about/IN toothpaste/NN for/IN about/IN 10/NN minutes/NNS now.../RB
Again we can see some tokenisation driven errors, but brain, hair, 10 and minutes pop out, which isn't too bad.
There are taggers that work quite differently, for example extracting language models (Hidden Markov Models) that model the probabilities extracted from the corpus, but given the amount of code for the results, I think the Brill tagger is a pretty nice option! There's much that could be done to tidy up this one, but particularly for long texts there is enough data to do some useful entity extraction and further processing.
Dave
November 20th, 2009 at 13:56
I remember doing some stuff like this back in Uni. Great to see something like this come up on planetphp and really interesting idea to apply it to twitter, which is presumably a pretty tough test.
Hasin Hayder
November 25th, 2009 at 14:21
Thanks a lot for this. I have been working on a translator project and recently found watson's work on that.
Thanks for this excellent conversion. Much appreciated. You saved my day.
Text Generation - PHP/ir
December 9th, 2009 at 10:00
...s - we'd want a larger body of text, but it might lend something to the process. We'll be using the PoS tagger from an earlier post, but because there are a couple of minor modifications it's included in this ...
Ed Parsons
March 30th, 2010 at 20:48
Hi PHP/ir I'm trying to implement this code and I get an error saying php has run out of space in memory. Now when I echo on each add of a word to the dict the script finishes but only "in" and "at" are found in the dict nothing else. Any ideas? Can I just increase the memory and any idea how big it needs to be?
Ian Barber
March 30th, 2010 at 22:46
Hi Ed. Yeah, it can be a bit memory hungry. What's your memory_limit set to currently? I'm not sure why the echo would have an effect, other than slowing it down though. I was running with a memory_limit of 132M if I recall correctly.
There's definitely room for improvement in the memory usage of this though, particularly if you want to build something on top of it.
Ed Parsons
March 31st, 2010 at 18:11
haha, figured it I was using array_shift twice hence why with the echo it worked, I've now looked more closely and found although the text file is 1.2M by the second round of a to z it bums out at c for me, which is about 12mb, do you know of a command in php where you can on the fly up the amount of memory allowed. I know there is something for pushing back timeouts.
You talked about improving the memory, for this how should I go about doing it, it doesn't matter too much about time although opening the file and reading it every time, probably wouldn't be too good an idea.
cheers Ed
Ian Barber
March 31st, 2010 at 21:33
It's an ini setting, so you can do php -d memory_limit=128M, or php_value memory_limit 128M , or with ini_set.
I would definitely cache the result, as you suggest. The array could be somewhat optimised, maybe keeping the tags in a string rather than a subarray. You could also save it out separately, and only loading the chunks needed. Probably generating the array out and just saving it out then restoring it with the larger memory limit will get you somewhere reasonable.
Mary Nicole Hicks
April 7th, 2010 at 22:14
You said that "Again we can see some tokenisation driven errors, but brain, hair, 10 and minutes pop out, which isn't too bad"
Why not use preg_match_all("/([\w\d]+(\p{Po}{0,1}[\w\d]+)?)+/", $tags, $matches);
Slower, but should be more accurate?
What do you think?
cynthiamyint
August 20th, 2010 at 20:22
I want to know POS tagger
David
December 18th, 2010 at 05:48
Hi Ian!
I think your post is very good. I've been looking for something like this for a while. I do see one major problem with this kind of "naive" tagging: It doesn't detect senses.
Sense detection is quite important when tagging because words can have different meanings (ergo grammatical functions) depending on the context of the word.
Are you aware of any important POS tagger written in PHP?
Cheers,
David
Ian Barber
December 27th, 2010 at 20:59
Hi David, it's absolutely true, sense detection and disambiguation is important for a lot of applications. I'm not aware of any particular php POS taggers - it's something that is probably easiest to hand off to one of the established packages, but if you do find a good PHP implementation do let me know!
Richard
February 21st, 2011 at 00:44
Hi Ian,
First of all many thanks for creating a PHP script to obey the rules of Brill tagging. I first starting looking at PoS tagging a few years back and it always captures the imagination.
Regarding 'word sense', Wordnet may be worth a look at due to its unique layout and 'sense' information.
My main reason for posting however is a heads up to anyone who wants it to run a little faster. It takes around 0.5 seconds to load the dictionary from file, I tried a (very quickly made) alternative using a MySQL MEMORY table.
The structure:
CREATE TABLE IF NOT EXISTS `lexicon` (
`lemma` varchar(255) NOT NULL,
`tags` varchar(255) NOT NULL,
UNIQUE KEY `lemma` (`lemma`)
) ENGINE=MEMORY
Getting data into the database:
LOAD DATA INFILE '/home/richard/lexicon.txt' IGNORE INTO TABLE lexicon FIELDS TERMINATED BY '\t';
This made it 10x faster, (0.05s vs 0.5s) on a couple of quick checks. Memory usage was slightly more at 40MB versus 30MB. This was by removing the __construct function and replacing the following
if(isset($this->dict[strtolower($token)])) {
$return[$i]['tag'] = $this->dict[strtolower($token)][0];
}
with
if($row = mysql_fetch_array(mysql_query('SELECT tags FROM lexicon WHERE lemma = \''.mysql_real_escape_string($token).'\''),MYSQL_ASSOC)) {
$return[$i]['tag'] = $row['tags'];
}
Of course, it would likely be faster if queries were sent a sentence at a time or blocks of words.
Also, perhaps changing
preg_match_all("/[\w\d\.\-]+/", $text, $matches);
...to read paragraphs may be of speed benefit.
Again, thanks very much for producing your tagging class.
Cecio Cecioni
July 11th, 2011 at 16:12
I found a little bug: the word "Italy" is categorized as and adverb. One possibile correction to the code is:
BEFORE:
// Anything that ends 'ly' is an adverb
if( substr($token, -2) == 'ly' ) {
$return[$i]['tag'] = 'RB';
}
AFTER:
// Anything that ends 'ly' is an adverb except the word "Italy"
if( substr($token, -2) == 'ly' and $token !== 'Italy') {
$return[$i]['tag'] = 'RB';
}
chaithanya
January 28th, 2012 at 15:56
HI..I am working on implementing a Parts of speech tagger for Telugu language using statistical based approach(TnT)...can u please suggest me a way for doing that..
Marilyn
December 11th, 2012 at 16:20
Hi Ian, is there any coding and tutorial in php where I can calculate the percentage of part-of-speech tags that have been correctly assigned to evaluate the performance of pos tagger? Thanks for your help.
Ian Barber
December 22nd, 2012 at 00:47
Marilyn: Nothing specific, but it's just a straightforward count of the matches between a manually tagged document and the automatically tagged one.
Bala Menon
February 16th, 2013 at 18:01
What a life saver. I wanted for my application to tag POS in a sentence and do some processing afterwards and using the NLTK library proved utterly painful just for POS.
Thanks so much for this! Works like a charm.
K. H.
April 3rd, 2013 at 20:04
Many thanks! I was looking for an online API to do some POS tagging for a project, but I couldn't find anything that would do the job. I came across this article and decided to give it a try -- and it works quite well!