PHP/ir

Information Retrieval and other interesting topics

PageRank In PHP

In: search

12 Dec 2009

Google was a better search engine than it's predecessors for a number of reasons, but probably the most well known one is PageRank, the algorithm for measuring the importance of a page based on what links to it. Though not necessarily that useful on its own, this kind of link analysis can be very helpful as part of a general information retrieval system, or when looking at any kind of network, such as a friend graph from a social network.

Larry Page came up with PageRank as a way of measuring the importance of a page while at Stanford in the mid nineties. The idea is that there is a random surfer, who starts on web page then browses through in a somewhat random way. For each page the surfer visits he randomly chooses one of the links from that page and follows it. From time to time though he gets bored, or hits a page with no links on it, and so jumps to some different page and starts the process again. The PageRank of a page is the proportion of his time that the surfer spends on the page.

This random walk is a Markov chain, a traversal of a matrix that gives the probability of moving to any state from any other state, with the total probabilities for the destinations from any given page adding up to 1. PageRank attempts to extract a vector of weights from that table, with one weight per page - they aren't related to any particularly query, so are only ever going to be a factor in a page being returned on a results page. However, all else being equal, a page with higher PageRank should be a better match than a page with lower PageRank.

As an aside, looking at links is also a convenient for one of the other key advances Google made - indexing pages based on the text of anchors pointing towards the page. This helps in cases where a term that the page might not use itself is searched for - for example the phrase 'big blue' for IBM is unlikely to be on their site, but will be in some links to that site. It also helps for resources which are hard to index themselves, such as videos and images, meaning they can be included in the results.

The general idea then of PageRank is pretty simply - a page hands out its PageRank to the pages it links to - but it leads to a chicken and egg situation. At the start of the process the page has no PageRank to hand out, just whatever it was initialised to. This means that the algorithm must be an iterative process, which converges on the true values.

Each page is initially given 1/the total number of pages of PageRank. At each step this is handed out to all the pages that it links to, and the base PageRank for each page replaced with the total amount it received, minus the damping factor to reflect the possibility of our random surfing typing a new URL in. We loop this until the PageRank is pretty stable, which is normally in the region of 30-50 iterations.

If we assume that we have already done our crawling and parsing steps, we will be left with an index of document IDs for different pages, and the document IDs of the pages that they link to. We don't necessarily need the whole link graph in memory, but for the example we'll assume that it does fit - this is patently not true for truly webscale items where we are likely to be talking billions or trillions of links.

<?php
$links = array( 
        1 => array(5),
        2 => array(4, 7, 8),
        3 => array(1, 3, 4, 7, 9),
        4 => array(1, 2, 4, 8),
        5 => array(1, 6, 7, 9),
        6 => array(1, 5, 8),
        8 => array(3, 4),
        9 => array(1, 4, 6, 8)
);
?>

If we look at this link graph, a few things stand out. Page 1 is pointed to by almost everything, so is presumably very good. Page 4 is also very popular, so should be expected to rank well.

We can then calculate the PageRank by running our iterative process. We've got a hard limit of 100 iterations, but we also keep track of the amount of change between one set of PageRanks and the next. If that change drops below a set level (here 0.00005) the process decides it's stable enough and stops.

<?php 
function calculatePageRank($linkGraph, $dampingFactor = 0.15) {
        $pageRank = array();
        $tempRank = array();
        $nodeCount = count($linkGraph); 

        // initialise the PR as 1/n
        foreach($linkGraph as $node => $outbound) {
                $pageRank[$node] = 1/$nodeCount;
                $tempRank[$node] = 0;
        }

        $change = 1;
        $i = 0;
        while($change > 0.00005 && $i < 100) {
                $change = 0;
                $i++;

                // distribute the PR of each page
                foreach($linkGraph as $node => $outbound) {
                        $outboundCount = count($outbound);
                        foreach($outbound as $link) { 
                                $tempRank[$link] += $pageRank[$node] / $outboundCount;
                        }
                }
                
                $total = 0;
                // calculate the new PR using the damping factor
                foreach($linkGraph as $node => $outbound) {
                        $tempRank[$node]  = ($dampingFactor / $nodeCount)
                                                + (1-$dampingFactor) * $tempRank[$node];
                        $change += abs($pageRank[$node] - $tempRank[$node]);
                        $pageRank[$node] = $tempRank[$node];
                        $tempRank[$node] = 0;
                        $total += $pageRank[$node];
                }

                // Normalise the page ranks so it's all a proportion 0-1
                foreach($pageRank as $node => $score) {
                        $pageRank[$node] /= $total;
                }
        }
        
        return $pageRank;
}
?>

The results we get are pretty much as we expect:

<?php
array(8) {
  [1]=>
  float(0.17084756816354)
  [2]=>
  float(0.060138597093041)
  [3]=>
  float(0.095270626028832)
  [4]=>
  float(0.17308244947008)
  [5]=>
  float(0.20422266588834)
  [6]=>
  float(0.086831014221959)
  [8]=>
  float(0.12476184607085)
  [9]=>
  float(0.084845233063364)
}
?>

The three main pages are 1 and 4, as we expected, but page 5 also gets a great rank because it is the only thing linked to by page 1.

We can apply this algorithm to any graph where the links between nodes have a direction from one to the other, such as determining important users on twitter by looking at @references between them, for example. Once we start calculating this data it's easy to start looking for different types of nodes, like hubs (outbound links to mostly pages with high PageRank) and authorities (lots of inbound links from pages with high PageRank), or calculating PageRank within certain topic groupings, all of which can contribute to results ranking, and a more effective search.