Text Generation

· December 9, 2009

language

After a rather technical post last week, something a bit lighter. Text and language generation is a fun topic with applications that run from randomly generating scientific papers for conferences, to the practical tasks of generating speech and automated responses. In this post we’ll look at how we can generate some nonsense text based on existing documents, which isn’t on the overly practical side, though it can make a fun change from Lorem Ipsum for holding copy. The code is throughout, but you can also grab the lot in a zip.

The basic idea is extract a set of probabilities of certain words appearing, and then generate a document based on this probability model. We’ll be using pairs of words as our statistical unit of choice, making this a two-word language model. This is really using Markov chain by modelling, for any given word in the text, what the probabilities are of moving to another word. You can imagine a Markov chain as a big table of states. By reading down each column you can see, for the state represented by that column, what the probability is of moving to any of the other states. This means that if we generate text based on walking through this table then much like the real word it’s hard to predict the exact outcome, but easy to predict the statistical properties.

The way we do this is really pretty simple: tokenise the text, then keep a count for each pair of tokens. For each word, we normalise the counts of the words that follow it to get a probability of moving from the first word to each of the second words. For generating the text itself we just choose a random first word, get the list of paired words and choose a random second word. We keep outputting them until we hit our word limit. In both these cases “random” means a random point 0-1, which we look up in the distribution to find the word - so more common words are more likely to come up.

PHP Text Generator

The class below does exactly this. When given a file to generate text based from, as an argument to the learn function, it first tokenises the file then stores pairs in an array of [first word][second word]. Each pair is associated with a count, which is then normalised between 0 and 1 for each set - so for any given word we have the probability of moving to each of it’s possible pairings, with the possibility of going to any words not seen paired implicitly being zero. This is accompanied by an array, rootScores, for all the first words - so that for the start of the algorithm or if we hit a sentence ender like ‘.’, we can choose a new word to start generating from.

The generate function performs the random walk through the probabilities. At each step it tries to pick a word from the pairs for the current word, and it tries to ensure that that word isn’t the same as the one it started with, to avoid very small loops. If the current word doesn’t have any pairs, or isn’t set (in the case of the first word or sentence enders), then it picks a starting word from rootScores. The actual picking is done by the pick function, which just generates a random float between 0 and 1 and returns the corresponding item from the passed array.

<?php
class LangGen {
        protected $model = array();
        protected $rootScores = array();
        protected $sentenceEnd = array('.', '!', '?');
        protected $joinSentence = array(',', ':', ';');

        public function learn($filePath) {
                $contents = strip_tags(file_get_contents($filePath));
                $tokens = $this->tokenise($contents);
                unset($contents);
                
                $prevToken = null;
                foreach($tokens as $token) {
                        if($prevToken) {
                                if(!isset($this->model[$prevToken])) {
                                        $this->model[$prevToken] = array();
                                }
                                if(!isset($this->model[$prevToken][$token])) {
                                        $this->model[$prevToken][$token] = 0;
                                }
                                $this->model[$prevToken][$token]++;
                        }
                        $prevToken = $token;
                        
                        // handle sentence enders
                        if(in_array($token, $this->sentenceEnd)) {
                                $prevToken = null;
                        } else {
                                if(!isset($this->rootScores[$token])) {
                                        $this->rootScores[$token] = 0;
                                }
                                $this->rootScores[$token]++;
                        }
                }
                unset($tokens);
                
                // normalise probabilities
                foreach($this->model as $key => $tokens) {
                        $this->model[$key] = $this->probNormalise($tokens);
                }
                $this->rootScores = $this->probNormalise($this->rootScores);
        }
        
        public function generate($length = 100) {
                $word = null;   
                for($i = 0; $i < $length; $i++) {
                        if(is_array($this->model[$word])) {
                                do {
                                        $return[$i] = $this->pick($this->model[$word]);
                                } while($word == $return[$i]);
                                $word = $return[$i];
                        } else {
                                $return[$i] = $word = $this->pick($this->rootScores);
                        }
                }
                return $this->generateString($return);
        }       
        
        protected function generateString(array $words) {
                $words[0] = ucwords($words[0]);
                foreach($words as $key => $word) {
                        if(in_array($word, $this->sentenceEnd)) {
                                $words[$key-1] .= $word;
                                unset($words[$key]);
                                $words[$key+1] = ucwords($words[$key+1]);
                        } else if(in_array($word, $this->joinSentence)) {
                                if(strlen($words[$key-1])) {
                                        $words[$key-1] .= $word;
                                }
                                unset($words[$key]);
                        }
                }
                return implode(' ', $words);
        }
        
        protected function probNormalise($array) {
                $total = array_sum($array);
                $runningScore = 0;
                foreach($array as $key => $score) {
                        $runningScore += ($score/$total);
                        $array[$key] = $runningScore; 
                }
                return $array;
        }

        protected function pick($array) {
                $floatRand = rand(0, 1000000) / 1000000.0;
                foreach($array as $key => $value) {
                        if($floatRand < $value) {
                                return $key;
                        }
                }
        }

        protected function tokenise($string) {
                preg_match_all("/[\'|\w]+|[\:|\;|\.|\?|\!|\,]/", $string, $matches); 
                foreach($matches[0] as $id => $match) {
                        if(is_numeric($match)) {
                                unset($matches[0][$id]);
                        } else {
                                $matches[0][$id] = strtolower($match);
                        }
                }
                return $matches[0]; 
        }
}
?>

The tokeniser is a bit different from previous ones, as we want to specifically separate out punctuation - we don’t really have to do this as actually leaving punctuation intact gives a very nice bit of text generation, but it’s interesting to view the relationship between these different types of punctuation and the words of the text.

For a change we’ve used a much longer example text, Jane Austen’s Pride & Prejudice from Project Gutenberg.

<?php
$langGen = new LangGen();
$langGen->learn('1342.txt');
echo $langGen->generate();
?>

"Hurst, when at her bracelets and when, was there an opportunity. Of admiration, lizzy, to see a charming, but she could be as politely by the reason for exertion of their own happiness overflows in his wishing them, gave them. Has been presented? I may depend on the wedding need not immediately on you have employment, in time before you out to stay, we are much of the preference of her desire of the hearth, which will keep winking at hunsford between them to make her brother."

However, you can get entertaining results on much smaller blocks of text, such as on websites. For example, from a page of the excellent Eloquent Javascript site.

<?php
$langGen = new LangGen();
$langGen->learn('http://eloquentjavascript.net/chapter6.html');
echo $langGen->generate();
?>

Stroustrup is often useful. So on three values, map takes two hours in the biggest kind of discarding the introduction to successful html documents, you have read like this to say '5 10' in the syntax here is chosen. That strings creates a function. The computer programming, b; else is the code. Href: footnote number footnote var footnotes are dry kind of a big strings. Function. Look embarrassingly amateurish. P: var paragraphs are getting entirely. The expressions inside the secret to baffle us enough.

Grammar

One thing we aren’t considering is any form of grammar. The simplest way we could approach this would be just to add the part of speech to each word and run the same process - we’d want a larger body of text, but it might lend something to the process. We’ll be using the PoS tagger from an earlier post, but because there are a couple of minor modifications it’s included in this zip of the code .

<?php
class PosLangGen extends LangGen { 
        private $tagger; 
        
        public function __construct($lexicon = 'lexicon.txt') {
                $this->tagger = new PosTagger($lexicon);
        }
        
        protected function tokenise($contents) {
                $tokens = parent::tokenise($contents);
                $tags = $this->tagger->tag($tokens);
                foreach($tokens as $i => $token) {
                        $return[] = $token . "/" . $tags[$i];
                }
                unset($tokens);
                unset($tags);
                return $return;
        }
        
        protected function generateString(array $words) {
                foreach($words as $key => $word) {
                        list($word, $tag) = explode("/", $word);
                        $words[$key] = $word;
                }
                return parent::generateString($words);
        }
}
?>

"Was. Darcy? Not my share of books were not long wished to inform us. On mrs. Lady catherine, by no means satisfy her friends than lovely and see them with mutual civility that these are conditions which often told me laugh at such a subject, indeed had passed some delicacy restrained her younger sisters, you for a colder voice whether she felt that is impossible for exposing herself. She then accompanied her. Tell you must not to another man from something, said elizabeth was out very soon began to make"

Send Mr Change for observations

The other approach we could take would be to try and extra a grammar model from the text, then generate a sequence of tags based on that model. We could then look up which words should fill each tag based on their own probabilities. This is actually another pretty easy tweak on top of the code we already have.

<?php
class PosGramGen extends LangGen { 
        private $tagger; 
        private $types;
        
        public function __construct($lexicon = 'lexicon.txt') {
                $this->tagger = new PosTagger($lexicon);
        }
        
        protected function tokenise($contents) {
                $tokens = parent::tokenise($contents);
                unset($contents);
                $tags = $this->tagger->tag($tokens);
                foreach($tags as $i => $tag) {
                        if(!isset($this->types[$tag])) { 
                                $this->types[$tag] = array();
                        }
                        if(!isset($this->types[$tag][$tokens[$i]])) {
                                $this->types[$tag][$tokens[$i]] = 0;
                        }
                        
                        $this->types[$tag][$tokens[$i]]++;
                        $return[] = $tag;
                }
                unset($tokens);
                unset($tags);
                foreach($this->types as $key => $types) {
                        $this->types[$key] = $this->probNormalise($types);
                }
                return $return;
        }
        
        protected function generateString(array $words) {
                foreach($words as $key => $tag) {
                        $words[$key] = $this->pick($this->types[$tag]);
                }
                return parent::generateString($words);
        }
}
?>

"Compatible palings air you is moreover, laughed passed to pass. Of wickham, what rest jane; whether feelings; and you was, had chiefly though love tenants latter mr, to send mr change for observations of the marriage of your darcy's pressed to the one with i if her half been contrary husband have satisfied types, who it may so to it but wickham employment of sentiment, mr side without either one obliged it pretty feel and their much spirits education me is at rather of sister jane. She; and her"

I couldn’t say it’s any better than the others, but with some work on the word choice logic you may be able to actually get it somewhere in the region of making sense, but whether that’s actually worth doing is entirely down to your own sensibilities.