The web is a great place for people to express their opinions, on just about any subject. Even the professionally opinionated, like movie reviewers, have blogs where the public can comment and respond with what they think, and there are a number of sites that deal in nothing more than this. The ability to automatically extract people's opinions from all this raw text can be a very powerful one, and it's a well studied area - no doubt because of the commercial possibilities.
Opinion mining, or sentiment analysis, is far from a solved problem though. People often express more than one opinion "the movie was terrible, but DeNiro's performance was superb, as always", use sarcasm "this is probably the best laptop Dell could come up with", or use negation and complex devices that can be hard to parse "not that I'm saying this was a bad experience".
On top of this, expressions of sentiment tend to be very topic focused - what works for one subject might not work for another. To use a well worn example, it's a good thing to say that the plot of a movie is unpredictable, but a bad thing to say it about the steering of a car. Even within a certain product, the same words can describe opposite feeling about different features - it's bad for the start-up time on a digital camera to be long, but it's good for the battery life to be long. This is why a great deal of work, particularly in product reviews, is spent in classifying which element of a product is being talked about, before starting the opinion mining process.
At the movies
We'll start with a simpler approach, and look at movie reviews. Luckily for us these are fairly easily available on line from places like Rotten Tomatoes and IMDB, and indeed a convenient data set of sentences expressing positive and negative opinions has already been compiled. We're using opinions expressed on the sentence level in order to give ourselves a little more granularity - while most movie reviews are longer than this, they will also usually express more than one opinion, and keeping our document unit smaller helps us avoid muddying the waters.
The data is supplied as two files, one for positive opinions and the other negative, with one sentence per line, which makes it easy to parse. To actually extract the opinion, we're going to make use of a classic and well known tool, a Naive Bayesian classifier. These were all the rage for spam filters a couple of years back, and are still a hugely popular way of doing filtering. They have the advantage that they're easy to implement, pretty effective, and quick to classify with.
Bayesian classifiers are based around the Bayes rule, a way of looking at conditional probabilities that allows you to flip the condition around in a convenient way. A conditional probably is a probably that event X will occur, given the evidence Y. That is normally written P(X | Y). The Bayes rule allows us to determine this probability when all we have is the probability of the opposite result, and of the two components individually: P(X | Y) = P(X)P(Y | X) / P(Y). This restatement can be very helpful when we're trying to estimate the probability of something based on examples of it occurring.
In this case, we're trying to estimate the probability that a document is positive or negative, given it's contents. We can restate that so that is in terms of the probability of that document occurring if it has been predetermined to be positive or negative. This is convenient, because we have examples of positive and negative opinions from our data set above.
The thing that makes this a "naive" Bayesian process is that we make a big assumption about how we can calculate at the probability of the document occurring: that it is equal to the product of the probabilities of each word within it occurring. This implies that there is no link between one word and another word. This independence assumption is clearly not true: there are lots of words which occur together more frequently that either do individually, or with other words, but this convenient fiction massively simplifies things for us, and makes it straightforward to build a classifier.
We can estimate the probability of a word occurring given a positive or negative sentiment by looking through a series of examples of positive and negative sentiments and counting how often it occurs in each class. This is what makes this supervised learning - the requirement for pre-classified examples to train on.
So, our initial formula looks like this.
P(sentiment | sentence) = P(sentiment)P(sentence | sentiment) / P(sentence)
We can drop the dividing P(line), as it's the same for both classes, and we just want to rank them rather than calculate a precise probability. We can use the independence assumption to let us treat P(sentence | sentiment) as the product of P( token | sentiment) across all the tokens in the sentence. So, we estimate P(token | sentiment) as
count(this token in class) + 1 / count(all tokens in class) + count( all tokens )
The extra 1 and count of all tokens is called 'add one' or Laplace smoothing, and stops a 0 finding it's way into the multiplications. If we didn't have it any sentence with an unseen token in it would score zero. We have implemented the above in the classify function of the following class:
We're implementing this in PHP in the classify function:
The classify function starts by calculating the prior probability (the chance of it being one or the other before any tokens are looked at) based on the number of positive and negative examples - in this example that'll always be 0.5 as we have the same amount of data for each. We then tokenise the incoming document, and for each class multiply together the likelihood of each word being seen in that class. We sort the final result, and return the highest scoring class.
The other important method here is addToIndex. All this does is loop over the data, tokenising the documents and storing counts of the terms for later use.
We can generate a slightly scrubby test set by not quite taking all the data, and using the remaining training examples to test with.
This gives an accuracy of around 0.8, which isn't bad really! To demonstrate it, we can chuck a couple of example sentences in:
Which returns as expected:
We can even use it on a longer review, as long as we split into sentences first. I grabbed the review of Avatar from The Scientific Indian.
Just to give a snippet of the output, we get:
So, broadly positive, which is the right direction!
There's a lot we haven't addressed in our classifier. We could pass the sentences through a couple of other classifiers first, using Bayesian techniques again, in order to determine some more useful facts. For example, is this even a review? If we just start processing blog posts, for example, we'll find a lot that mention a movie without actually saying whether it's good or bad, and we may as well discard those.
Then, for each sentence, which part of the movie is it talking about? We might be able to correctly interpret a review which slams the actor, slates the script, but was impressed with the special effects. At each stage, the process would be the same as this time - find or create training data, train a classifier, and let it go to work.
We could also look at more complicated language models and named entity extractors, that allow us to map the odd phrases that sometimes occur, and associate opinions with the appropriate parts of a sentence. This can be a lot more work, but can also lead to higher accuracy and reliability.