The web is a great place for people to express their opinions, on just about any subject. Even the professionally opinionated, like movie reviewers, have blogs where the public can comment and respond with what they think, and there are a number of sites that deal in nothing more than this. The ability to automatically extract people's opinions from all this raw text can be a very powerful one, and it's a well studied area - no doubt because of the commercial possibilities.
Opinion mining, or sentiment analysis, is far from a solved problem though. People often express more than one opinion "the movie was terrible, but DeNiro's performance was superb, as always", use sarcasm "this is probably the best laptop Dell could come up with", or use negation and complex devices that can be hard to parse "not that I'm saying this was a bad experience".
On top of this, expressions of sentiment tend to be very topic focused - what works for one subject might not work for another. To use a well worn example, it's a good thing to say that the plot of a movie is unpredictable, but a bad thing to say it about the steering of a car. Even within a certain product, the same words can describe opposite feeling about different features - it's bad for the start-up time on a digital camera to be long, but it's good for the battery life to be long. This is why a great deal of work, particularly in product reviews, is spent in classifying which element of a product is being talked about, before starting the opinion mining process.
We'll start with a simpler approach, and look at movie reviews. Luckily for us these are fairly easily available on line from places like Rotten Tomatoes and IMDB, and indeed a convenient data set of sentences expressing positive and negative opinions has already been compiled. We're using opinions expressed on the sentence level in order to give ourselves a little more granularity - while most movie reviews are longer than this, they will also usually express more than one opinion, and keeping our document unit smaller helps us avoid muddying the waters.
The data is supplied as two files, one for positive opinions and the other negative, with one sentence per line, which makes it easy to parse. To actually extract the opinion, we're going to make use of a classic and well known tool, a Naive Bayesian classifier. These were all the rage for spam filters a couple of years back, and are still a hugely popular way of doing filtering. They have the advantage that they're easy to implement, pretty effective, and quick to classify with.
Bayesian classifiers are based around the Bayes rule, a way of looking at conditional probabilities that allows you to flip the condition around in a convenient way. A conditional probably is a probably that event X will occur, given the evidence Y. That is normally written P(X | Y). The Bayes rule allows us to determine this probability when all we have is the probability of the opposite result, and of the two components individually: P(X | Y) = P(X)P(Y | X) / P(Y). This restatement can be very helpful when we're trying to estimate the probability of something based on examples of it occurring.
In this case, we're trying to estimate the probability that a document is positive or negative, given it's contents. We can restate that so that is in terms of the probability of that document occurring if it has been predetermined to be positive or negative. This is convenient, because we have examples of positive and negative opinions from our data set above.
The thing that makes this a "naive" Bayesian process is that we make a big assumption about how we can calculate at the probability of the document occurring: that it is equal to the product of the probabilities of each word within it occurring. This implies that there is no link between one word and another word. This independence assumption is clearly not true: there are lots of words which occur together more frequently that either do individually, or with other words, but this convenient fiction massively simplifies things for us, and makes it straightforward to build a classifier.
We can estimate the probability of a word occurring given a positive or negative sentiment by looking through a series of examples of positive and negative sentiments and counting how often it occurs in each class. This is what makes this supervised learning - the requirement for pre-classified examples to train on.
So, our initial formula looks like this.
P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)
We can drop the dividing P(line), as it's the same for both classes, and we just want to rank them rather than calculate a precise probability. We can use the independence assumption to let us treat P(sentence | sentiment) as the product of P( token | sentiment) across all the tokens in the sentence. So, we estimate P(token | sentiment) as
count(this token in class) + 1 / count(all tokens in class) + count( all tokens )
The extra 1 and count of all tokens is called 'add one' or Laplace smoothing, and stops a 0 finding it's way into the multiplications. If we didn't have it any sentence with an unseen token in it would score zero. We have implemented the above in the classify function of the following class:
We're implementing this in PHP in the classify function:
<?php
class Opinion {
private $index = array();
private $classes = array('pos', 'neg');
private $classTokCounts = array('pos' => 0, 'neg' => 0);
private $tokCount = 0;
private $classDocCounts = array('pos' => 0, 'neg' => 0);
private $docCount = 0;
private $prior = array('pos' => 0.5, 'neg' => 0.5);
public function addToIndex($file, $class, $limit = 0) {
$fh = fopen($file, 'r');
$i = 0;
if(!in_array($class, $this->classes)) {
echo "Invalid class specified\n";
return;
}
while($line = fgets($fh)) {
if($limit > 0 && $i > $limit) {
break;
}
$i++;
$this->docCount++;
$this->classDocCounts[$class]++;
$tokens = $this->tokenise($line);
foreach($tokens as $token) {
if(!isset($this->index[$token][$class])) {
$this->index[$token][$class] = 0;
}
$this->index[$token][$class]++;
$this->classTokCounts[$class]++;
$this->tokCount++;
}
}
fclose($fh);
}
public function classify($document) {
$this->prior['pos'] = $this->classDocCounts['pos'] / $this->docCount;
$this->prior['neg'] = $this->classDocCounts['neg'] / $this->docCount;
$tokens = $this->tokenise($document);
$classScores = array();
foreach($this->classes as $class) {
$classScores[$class] = 1;
foreach($tokens as $token) {
$count = isset($this->index[$token][$class]) ?
$this->index[$token][$class] : 0;
$classScores[$class] *= ($count + 1) /
($this->classTokCounts[$class] + $this->tokCount);
}
$classScores[$class] = $this->prior[$class] * $classScores[$class];
}
arsort($classScores);
return key($classScores);
}
private function tokenise($document) {
$document = strtolower($document);
preg_match_all('/\w+/', $document, $matches);
return $matches[0];
}
}
?>
The classify function starts by calculating the prior probability (the chance of it being one or the other before any tokens are looked at) based on the number of positive and negative examples - in this example that'll always be 0.5 as we have the same amount of data for each. We then tokenise the incoming document, and for each class multiply together the likelihood of each word being seen in that class. We sort the final result, and return the highest scoring class.
The other important method here is addToIndex. All this does is loop over the data, tokenising the documents and storing counts of the terms for later use.
We can generate a slightly scrubby test set by not quite taking all the data, and using the remaining training examples to test with.
<?php
$op = new Opinion();
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.neg', 'neg', 5000);
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.pos', 'pos', 5000);
$i = 0; $t = 0; $f = 0;
$fh = fopen('opinion/rt-polaritydata/rt-polarity.neg', 'r');
while($line = fgets($fh)) {
if($i++ > 5001) {
if($op->classify($line) == 'neg') {
$t++;
} else {
$f++;
}
}
}
echo "Accuracy: " . ($t / ($t+$f));
?>
This gives an accuracy of around 0.8, which isn't bad really! To demonstrate it, we can chuck a couple of example sentences in:
<?php
$op = new Opinion();
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.neg', 'neg');
$op->addToIndex('opinion/rt-polaritydata/rt-polarity.pos', 'pos');
$string = "Avatar had a surprisingly decent plot, and genuinely incredible special effects";
echo "Classifying '$string' - " . $op->classify($string) . "\n";
$string = "Twilight was an atrocious movie, filled with stumbling, awful dialogue, and ridiculous story telling.";
echo "Classifying '$string' - " . $op->classify($string) . "\n";
?>
Which returns as expected:
Classifying 'Avatar had a surprisingly decent plot,
and genuinely incredible special effects' - pos
Classifying 'Twilight was an atrocious movie, filled with
stumbling, awful dialogue, and ridiculous story
telling.' - neg
We can even use it on a longer review, as long as we split into sentences first. I grabbed the review of Avatar from The Scientific Indian.
<?php
// … snip … article contents as $op setup
$sentences = explode(".", $doc);
$score = array('pos' => 0, 'neg' => 0);
foreach($sentences as $sentence) {
if(strlen(trim($sentence))) {
$class = $op->classify($sentence);
echo "Classifying: \"" . trim($sentence) . "\" as " . $class . "\n";
$score[$class]++;
}
}
var_dump($score);
?>
Just to give a snippet of the output, we get:
Classifying: "Fortunately, the movie's moral premise plays
second fiddle to the technical feats" as neg
Classifying: "I enjoyed the movie" as pos
Classifying: "The ending is especially poignant" as pos
Classifying: "The visual effects are spectacular and a lot of
the production techniques are a first in the craft
of movie making" as pos
Classifying: "For that alone, the movie is a must see" as pos
array(2) {
["pos"]=>
int(25)
["neg"]=>
int(11)
}
So, broadly positive, which is the right direction!
There's a lot we haven't addressed in our classifier. We could pass the sentences through a couple of other classifiers first, using Bayesian techniques again, in order to determine some more useful facts. For example, is this even a review? If we just start processing blog posts, for example, we'll find a lot that mention a movie without actually saying whether it's good or bad, and we may as well discard those.
Then, for each sentence, which part of the movie is it talking about? We might be able to correctly interpret a review which slams the actor, slates the script, but was impressed with the special effects. At each stage, the process would be the same as this time - find or create training data, train a classifier, and let it go to work.
We could also look at more complicated language models and named entity extractors, that allow us to map the odd phrases that sometimes occur, and associate opinions with the appropriate parts of a sentence. This can be a lot more work, but can also lead to higher accuracy and reliability.
Photo Credit: Grégory Tonon
Ivo Nascimento
January 21st, 2010 at 13:17
Great article.
I like this subject to and think your code a beautifull implementation.
thanks for sharing!
Jordan Walker
January 21st, 2010 at 14:22
How do you take into account a change of opinion, or in the case of purchasing: buyer remorse?
Improved Bayesian Opinion Mining « NLP e dintorni
March 8th, 2010 at 22:56
Improved Bayesian Opinion Mining Mi ha molto interessato questo articolo di Ian Barber in cui si mostra come applicare l’analisi bayesiana su una raccolta di “o...
darko
March 8th, 2010 at 23:13
Hi Ian,
thanks for sharing this. I've played a little with your code, I wanted to see if excluding closed grammar categories from being counted could improve the accuracy... well, not much but if you're curious here is the result:
0.826747720365 Vs 0.829787234043
I've reported it on my brand new blog http://darkoromanov.wordpress.com/2010/03/08/improved-bayesian-opinion-mining/
darko
Ian Barber
March 9th, 2010 at 07:55
Interesting result darko, thanks! I think your guess about the 'noise' of the unhelpful words being consistent through the documents was about right, but it's still a nice bit of preprocessing to do.
Chiranjibi Sitaula
April 2nd, 2010 at 11:21
nice code and it is helpful for me..
CDCSIT,TU ,Mepal
Should Opinion Be a Ranking Factor? • Tim Nash “stuff” Blog
May 15th, 2010 at 14:32
...nces with negative or positive opinions. The code for our initial version was heavily influenced by Baysian opinion mining code on PHPIR. However, we did end up rewiting the code in C++ as a php module to speed up the resu...
Chiranjibi Sitaula
May 28th, 2010 at 10:08
Nice codes..can't we calculate precision and recall in this code...
Regards
Chiranjibi Sitaula
CDCSIT,TU,Nepal
Shahzeb
July 3rd, 2010 at 03:00
I am getting these 3 errors, please help me!
Warning: fopen(opinion/rt-polaritydata/rt-polarity.neg) [function.fopen]: failed to open stream: No such file or directory in C:\EasyPHP 2.0b1\www\bayesianopinionmining\bom.php on line 13
Warning: Division by zero in C:\EasyPHP 2.0b1\www\bayesianopinionmining\bom.php on line 86
Notice: Undefined variable: doc in C:\EasyPHP 2.0b1\www\bayesianopinionmining\bom.php on line 103
Shahzeb
July 3rd, 2010 at 03:06
Thanks above errors are resolved but still only one error left now
Notice: Undefined variable: doc in C:\EasyPHP 2.0b1\www\bayesianopinionmining\bom.php on line 103
please help me!
Ian Barber
July 5th, 2010 at 09:16
If that's from the longer review snippet, you probably just need to set $doc equal to the string you want to split into sentences.
achofanto
July 5th, 2010 at 11:05
THanks for the tutorial. Appreciate if you could also do association rule mining in php
THanks again!
Should Opinion Be a Ranking Factor? | The Online Blog
August 27th, 2010 at 12:03
...nces with negative or positive opinions. The code for our initial version was heavily influenced by Baysian opinion mining code on PHPIR. However, we did end up rewiting the code in C++ as a php module to speed up the resu...
Should Opinion Be a Ranking Factor? | The Online Blog
August 30th, 2010 at 00:33
Baysian mining
Aku
October 24th, 2010 at 04:55
I wish you could rewrite your code in java.
Rio Akasaka
January 15th, 2011 at 02:34
This is jaw-droppingly awesome.
Sameer
March 2nd, 2011 at 10:02
Where do I get the file: 'opinion/rt-polaritydata/rt-polarity.neg' and .pos from?
Ian Barber
March 2nd, 2011 at 11:09
It's linked in the article Sameer, http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz
John
April 12th, 2011 at 20:51
What I can't understand is how you decide if a word is negative or positive? Is there a word bank involved? I didn't see any mention of one.
Ian Barber
April 13th, 2011 at 11:15
Hi John, that's the point of the algorithm, it learns whether things are good or bad based on the messages you train on - whole comments are marked as positive or negative, and the algorithm determines the positiveness or negativeness of words.
Anonymous
June 27th, 2011 at 14:35
…ine is not a complete answer. But please check also the PHP example given by Ian Barber at the page http://phpir.com/bayesian-opinio... - there you will find a good example of a simple implementation. Good work :)This answer .Please sp…
Self-Improving Bayesian Sentiment Analysis for Twitter « Wiki
August 28th, 2011 at 13:35
…more likely to result in a ‘positive’ tweet, and ‘fail’ as a negative tweet.
I started with Ian Barber’s excellent PHP class for simple Bayesian classification, but wanted to improve the basic quality.
The simplest way to do…
Matt Kaufmann
September 3rd, 2011 at 12:15
Very nice; thanks for using PHP.
David
October 25th, 2011 at 21:20
Thanks, for making this available. In PHP and all :) I tried your vector space model implementation and it worked beautifully. I´m sure this one will to once I get into it.
James
December 19th, 2011 at 00:58
Thanks Ian, this has been a hugely helpful post!
After following the blog post I created my own bayesian sentiment classifier in php heavily based on this article.
https://github.com/JWHennessey/phpInsight
Stefan Vasco
January 18th, 2012 at 13:19
Use a stemmer, the accuracy will be around 90%, this is how I implemented it.
Great job! Thank you!
How to optimise an sentiment analysis algorithm for larger data sets? | DIGG LINK
May 25th, 2012 at 14:35
…s? Asked by: 1 views , , , , , , I am a noob to sentiment analysis and found a good resource for Bayesian Opinion Mining and a way to . I was wondering though, if the optimum analysis is dependent upon the supplied data …
naveen kuamr
September 9th, 2012 at 14:56
great work!..
I am trying to run it on thousands of string and it gives expected answer but takes lot of time to finish...
any idea why is it so slow and how can i reduce execution time.
bbx402
November 10th, 2012 at 21:49
great work !!
I'm having trouble with "Division by zero" in windows.
Can't figure out why.
Any help?
Ian Barber
November 16th, 2012 at 22:41
bbx402: Depends where it's happening - should be fairly clear what to exclude. What is the line where the error is being triggered.