Information Retrieval and other interesting topics

Linear Regression In PHP

In: classification, statistics

10 Oct 2011

I've had a couple of emails recently about the excellent Stanford Machine Learning and AI online classes, so I thought I'd put up the odd post or two on some of the techniques they cover, and what they might look like in PHP.

Xapian More Like This In PHP

In: xapian

04 Apr 2011

For my own benefit, if nothing else, since I keep seeming to need this snippet of code, I thought I'd encapsulate a Xapian More Like This/Find Similar example in a very brief blog post.

Benfords Law

In: statistics

04 Apr 2011

Benfords Law is not an exciting new John Nettles based detective show, but an interesting observation about the distribution of the first digit in sets of numbers originating from various processes. It says, roughly, that in a big collection of data you should expect to see a number starting with 1 about 30% of the time, but starting with 9 only about 5% of the time. Precisely, the proportion for a given digit can be worked out as:

Monte Carlo Simulations

In: probability

07 Jul 2010

Monte Carlo simulations are a handy tool for looking at situations that have some aspect of uncertainty, by modelling them with a pseudo-random element and conducting a large number of trials. There isn’t a hard and fast Monte Carlo algorithm, but the process generally goes: start with a situation you wish to model, write a program to describe it that includes a random input, run that program many times, and look at the results.

Bayesian Opinion Mining

In: classification, probability

01 Jan 2010

The web is a great place for people to express their opinions, on just about any subject. Even the professionally opinionated, like movie reviewers, have blogs where the public can comment and respond with what they think, and there are a number of sites that deal in nothing more than this. The ability to automatically extract people's opinions from all this raw text can be a very powerful one, and it's a well studied area - no doubt because of the commercial possibilities.