Tokenisation

· September 5, 2009

special interest

Taking a string and separating it into tokens is one of those smaller problems in search that seems initially simple - split on spaces - but can quickly become overwhelmed with edge cases. Ignoring the problem of other languages, some of which don’t even necessarily use a space, the exceptions tend to fall into two categories, punctuation related and normalisation.

The punctuation problems stem from these characters are processed. Take a look at this sentence:

“That’s not one of the department’s cars…” said O’Leary, “it’s the F.B.I.”

Splitting on spaces would give us the following tokens:

“That’s
not
one
of
the
department’s
cars…”
said
O’Leary,
“it’s
the
F.B.I.”

Obviously we don’t want people to have to search for “O’Leary,”, but splitting on the punctuation as well would give us “F”, “B” and “I” as separate tokens, which doesn’t seem ideal. Even filtering out the punctuation afterwards, which looks like a better bet, means “department’s” becomes “departments” so a search for “department” wouldn’t match. It’s never too hard to think of some exceptions to whichever set of rules we use, which makes picking the best general solution rather subjective.

Some particularly tricky areas include hyphens - one added mid-word should probably be stripped and the word collapsed into one token, but a phrase that has been glued together using it, such as “knock-down-drag-out”, probably shouldn’t. A forum post by a user called -=[ShOgUn]=- is even trickier.

While there isn’t an ideal solution, the problem is mitigated a somewhat if the tokenising for the documents and the query are the same. If not, you could have the document tokenise O’Leary to OLeary, but the query leave it as O’Leary. As long as they operate in the same manner, the system should perform reasonably.

Of course, the question of case folding and other normalisation can still be an issue. If you do get a word with a hyphen, do you remove the hyphen and index that, or index the document against both the hyphenated and unhyphenated version? Case folding can be similarly tricky - converting everything to lower case is straightforward, but you can lose some helpful distinctions. A capital letter at the start of a sentence can be safely ignore, but one in the middle might indicate a proper noun:

Friends are forever I watched Friends last night This can include all caps, which might indicate an acronym. It’s pretty easy to see the ‘The Who’ is a the band ‘the WHO’ is the world health organisation and ‘who?’ is probably a word we could ignore without much loss, but once tokenised and lowercased, they all look the same. That said, lowercasing is not a terrible strategy, but it sometimes can be a missed opportunity for extracting useful data from the text.

Zend Search Lucene

Lucene uses a flexible system to allow plugging in different ‘Analyzers’ as required. The implementation in Zend Framework is of interest due to being in raw PHP, and the default tokeniser can be found in Zend_Search_Lucene_Analysis_Analyzer_Common_Text_CaseInsensitive. This takes the simple route and just looks for strings of letter characters:

preg_match('/[a-zA-Z]+/', $this->_input, $match, PREG_OFFSET_CAPTURE, $this->_position));

This is then transformed to lowercase using Zend_Search_Lucene_Analysis_TokenFilter_LowerCase. All in all, this system, despite being simple, is generally pretty robust - it’ll generate a reasonable set of tokens for a wide range of documents.

Xapian

Xapian on the other hand takes a more complicated route straight out of the box. As well as treating runs of letter characters as tokens, it allows them to contain “_” (which is handy for searching references to underscore_separated_variables), the apostrophe, and the ampersand (for AT&T and similar). Xapian also allows suffixes of up to 3 characters on words, to catch tokens like “C++”, and it tries to catch acronyms split with periods, such as ‘F.B.I.’ in our example.

The parser is based around iterating over a stream of unicode characters. Tokens are generally converted to lower case, the same as Lucene, but stemming is also performed. Stemming is an interesting technique, worth a post of it’s own, that tries to generate a standard ‘stem’ word from a number of derivatives, e.g. “hiding”, “hides” and and “hidden” might all map to “hid”. You can read more about term generator on the xapian trac.

A PHP equivalent of the Xapian tokeniser would probably need to work on a stream basis, but we can pick up some of the punctuation rules easily in a regular expression. There is UTF8 version of the ZSL token analyser which picks up sequences of UTF8 ‘letter’ code points*, but we can go a bit further towards the definition of a word used in Xapian. We’ll make use of the matchers for the letter codepoint \p{L} (\p for ‘point’, and L for Letter), \p{Mn} non spacing mark (such as an accent that applies to the previous character), \p{N} number, and \p{Pc} connecting punctuation (primarily for the underscore). We also allow a suffix of up to three + or # characters to pick up a term like C#.

preg_match('/[\p{L}\p{N}][\p{L}\p{Mn}\p{N}\p{Pc}&\']*[\+\#]{0,3}/u', $input, $matches);

There are hundreds more ways of tokenising, and we haven’t looked at issues like looking for phrase searches, or performance (which can be absolutely key in big systems - just using unicode slows down regular expressions in PHP for example). The tokeniser can easily be a source of poor results on certain queries though, and is worth a bit of thought whenever text processing is required.

These aren’t quite characters - for example an accented character might be specified as one code point for the letter and another for the accent. ** Note, that this would likely require using iconv or similar to normalise text to UTF-8