Until now, all the posts here have looked at text in a purely statistical way. What the words actually were was less important than how common they were, and whether they occurred in a query or a category. There are plenty of applications, however, where a deeper parsing of the text could be huge beneficial, and the first step in such parsing is often part of speech tagging.
The tags in question are the grammatical parts of speech that the words fall into, the traditional noun, verb, adjective and so on that hopefully most people will dimly remember. Being able to tag a document appropriately is hugely helpful in trying to extract what document is discussing, and in determining other aspects of the text that are self-evident for a human reader, but tricky to determine statistically, particularly with a small number of examples.
The parts of speech are somewhat difficult to work out completely automatically, and even humans can get stuck with words that have many possible interpretations. Almost every system around utilises a corpus, a set of documents that have their words hand tagged (or hand verified) for parts of speech. This can then be used to extract statistics, and build taggers. Because there are many more parts of speech than may come to mind, there are various codes that are used to tag the files, a full list for the common Brown corpus is available on wikipedia. Some examples are NN for noun, NNS for plural noun, VB for verb, VBD for verb past tense, and a tagged string might look like this:
For our implementation, we'll look at a relatively simple to write tagger invented by Eric Brill in the early nineties. The tagger was trained by analysing a corpus and noting the frequencies of the different tags for a given word. As words were tagged they were assigned to the most frequent tag for the word if it was in the corpus, or tagged as a noun if not. Then a series of transformations were applied, which changed the tag depending on various conditions. The results were compared to known correct tags, and the rules that added the most accuracy retained.
Luckily for us, we can just use the most successful rules, and don't have to reimplement the whole thing. The code here draws from the (many!) implementations of the Brill tagger by Mark Watson in various languages. The rules are pretty straightforward, such making a word a past participle if it ends with 'ed', or an adverb if it ends with 'ly'.
The lexicon for the class is available, or could be extracted with some work from the Brown corpus itself. There are bigger corpora available, which could give better results, but at the cost of more processing, and more overhead.
While with the quick brown fox example we got perfect tagging (see the example up above), but for a tougher test we can try this with the grammatical powerhouse that is twitter. While we might not get perfect results, hopefully we should get something in the ballpark, and to keep it interesting we can take a look at the nouns that are tagged to see how they fit the message. Thanks to Sam, Helgi and Johanna for their tweets.
Noun wise, this has picked up Coffee, Today, Ones, Mind and zzzzzzz, which does sum up the message pretty nicely. Notice that the typo of 'alert' is mistagged as is "I've", suffering from the simplicity of the tokeniser.
Again on the nouns we have: I, twitter, h, m, mention, Gah, I, Jimmy, Choo, wtf, things. Again an extension to the tokeniser could help here, and an addition to the lexicon to get wtf marked as UH (an interjection or exclamation).
Again we can see some tokenisation driven errors, but brain, hair, 10 and minutes pop out, which isn't too bad.
There are taggers that work quite differently, for example extracting language models (Hidden Markov Models) that model the probabilities extracted from the corpus, but given the amount of code for the results, I think the Brill tagger is a pretty nice option! There's much that could be done to tidy up this one, but particularly for long texts there is enough data to do some useful entity extraction and further processing.