Google was a better search engine than it's predecessors for a number of reasons, but probably the most well known one is PageRank, the algorithm for measuring the importance of a page based on what links to it. Though not necessarily that useful on its own, this kind of link analysis can be very helpful as part of a general information retrieval system, or when looking at any kind of network, such as a friend graph from a social network.
Larry Page came up with PageRank as a way of measuring the importance of a page while at Stanford in the mid nineties. The idea is that there is a random surfer, who starts on web page then browses through in a somewhat random way. For each page the surfer visits he randomly chooses one of the links from that page and follows it. From time to time though he gets bored, or hits a page with no links on it, and so jumps to some different page and starts the process again. The PageRank of a page is the proportion of his time that the surfer spends on the page.
This random walk is a Markov chain, a traversal of a matrix that gives the probability of moving to any state from any other state, with the total probabilities for the destinations from any given page adding up to 1. PageRank attempts to extract a vector of weights from that table, with one weight per page - they aren't related to any particularly query, so are only ever going to be a factor in a page being returned on a results page. However, all else being equal, a page with higher PageRank should be a better match than a page with lower PageRank.
As an aside, looking at links is also a convenient for one of the other key advances Google made - indexing pages based on the text of anchors pointing towards the page. This helps in cases where a term that the page might not use itself is searched for - for example the phrase 'big blue' for IBM is unlikely to be on their site, but will be in some links to that site. It also helps for resources which are hard to index themselves, such as videos and images, meaning they can be included in the results.
The general idea then of PageRank is pretty simply - a page hands out its PageRank to the pages it links to - but it leads to a chicken and egg situation. At the start of the process the page has no PageRank to hand out, just whatever it was initialised to. This means that the algorithm must be an iterative process, which converges on the true values.
Each page is initially given 1/the total number of pages of PageRank. At each step this is handed out to all the pages that it links to, and the base PageRank for each page replaced with the total amount it received, minus the damping factor to reflect the possibility of our random surfing typing a new URL in. We loop this until the PageRank is pretty stable, which is normally in the region of 30-50 iterations.
If we assume that we have already done our crawling and parsing steps, we will be left with an index of document IDs for different pages, and the document IDs of the pages that they link to. We don't necessarily need the whole link graph in memory, but for the example we'll assume that it does fit - this is patently not true for truly webscale items where we are likely to be talking billions or trillions of links.
<?php
$links = array(
1 => array(5),
2 => array(4, 7, 8),
3 => array(1, 3, 4, 7, 9),
4 => array(1, 2, 4, 8),
5 => array(1, 6, 7, 9),
6 => array(1, 5, 8),
8 => array(3, 4),
9 => array(1, 4, 6, 8)
);
?>
If we look at this link graph, a few things stand out. Page 1 is pointed to by almost everything, so is presumably very good. Page 4 is also very popular, so should be expected to rank well.
We can then calculate the PageRank by running our iterative process. We've got a hard limit of 100 iterations, but we also keep track of the amount of change between one set of PageRanks and the next. If that change drops below a set level (here 0.00005) the process decides it's stable enough and stops.
<?php
function calculatePageRank($linkGraph, $dampingFactor = 0.15) {
$pageRank = array();
$tempRank = array();
$nodeCount = count($linkGraph);
// initialise the PR as 1/n
foreach($linkGraph as $node => $outbound) {
$pageRank[$node] = 1/$nodeCount;
$tempRank[$node] = 0;
}
$change = 1;
$i = 0;
while($change > 0.00005 && $i < 100) {
$change = 0;
$i++;
// distribute the PR of each page
foreach($linkGraph as $node => $outbound) {
$outboundCount = count($outbound);
foreach($outbound as $link) {
$tempRank[$link] += $pageRank[$node] / $outboundCount;
}
}
$total = 0;
// calculate the new PR using the damping factor
foreach($linkGraph as $node => $outbound) {
$tempRank[$node] = ($dampingFactor / $nodeCount)
+ (1-$dampingFactor) * $tempRank[$node];
$change += abs($pageRank[$node] - $tempRank[$node]);
$pageRank[$node] = $tempRank[$node];
$tempRank[$node] = 0;
$total += $pageRank[$node];
}
// Normalise the page ranks so it's all a proportion 0-1
foreach($pageRank as $node => $score) {
$pageRank[$node] /= $total;
}
}
return $pageRank;
}
?>
The results we get are pretty much as we expect:
<?php
array(8) {
[1]=>
float(0.17084756816354)
[2]=>
float(0.060138597093041)
[3]=>
float(0.095270626028832)
[4]=>
float(0.17308244947008)
[5]=>
float(0.20422266588834)
[6]=>
float(0.086831014221959)
[8]=>
float(0.12476184607085)
[9]=>
float(0.084845233063364)
}
?>
The three main pages are 1 and 4, as we expected, but page 5 also gets a great rank because it is the only thing linked to by page 1.
We can apply this algorithm to any graph where the links between nodes have a direction from one to the other, such as determining important users on twitter by looking at @references between them, for example. Once we start calculating this data it's easy to start looking for different types of nodes, like hubs (outbound links to mostly pages with high PageRank) and authorities (lots of inbound links from pages with high PageRank), or calculating PageRank within certain topic groupings, all of which can contribute to results ranking, and a more effective search.
Artur Ejsmont
December 16th, 2009 at 13:57
First of all, I like the article, its interesting, refreshing i would say. For change someone writes something new and it does not mention tweeter!!! :- )
"This random walk is a Markov chain, a traversal of a matrix that gives the probability of moving to any state from any other state, with the total probabilities for the destinations from any given page adding up to 1"
The only problem with this definition and page rank itself is that if you think of it its a self fulfilling prophecy. The more popular page is the more links it will get as people have better chance finding it and the higher rank it gets.
I dont have time to study the algorithm right now ... but seems like it could be quite useful. i'll bookmark it now :- )
thanks
Art
Ian Barber’s Blog: PageRank In PHP | Webs Developer
December 16th, 2009 at 16:01
... Blog: PageRank In PHP Posted by web slinger on December - 16 - 2009 Ian Barber has put together a new blog post that looks at his creation of a simple PageRank-style algorithm, similar to the ideas behind Google...
James
December 16th, 2009 at 16:17
Interesting post, i've bookmarked it. I won't lie - my brain started to hurt towards the end, but thanks.
Ian Barber’s Blog: PageRank In PHP | Development Blog With Code Updates : Developercast.com
December 16th, 2009 at 18:10
...iew all posts in Development" rel="category tag">Development Ian Barber has put together a new blog post that looks at his creation of a simple PageRank-style algorithm, similar to the ideas behind Google...
vivanno.com::aggregator » Archive » PageRank en PHP
December 17th, 2009 at 02:02
...à titre indicatif car vous ne pouvez pas vous fier à une seule source d’informations. PageRank en PHP () Pas de commentaire Pas encore de commentaire. ...
PageRank In PHP – Ian Barber | UK Hosting Directory
December 17th, 2009 at 07:06
...as a way of measuring the importance of a page while at Stanford in the mid nineties
Original post:
PageRank In PHP – Ian Barber Leave a comment ...
2009/12/18に気になったこと | debeso
December 18th, 2009 at 15:05
...ject Hosting on Google Code gearsをhtml5で扱えるAPIに対応させる簡易ライブラリ PageRank In PHP – PHP/ir PHPで連想配列から出現率の高いものをランク付けするアルゴリズム Posted i...
PageRank In PHP – PHP/ir | Coder Online
December 24th, 2009 at 13:15
...f a simple PageRank-style algorithm, similar to the ideas behind Google. …
See the rest here:
PageRank In PHP – PHP/ir Tags: barber...
Andy
March 7th, 2011 at 18:09
You forgot to handle the exception occurring when dealing with page 7 in your PHP script:
"Notice: Undefined offset: 7"
Even in your own pasted output, the rank for page 7 is missing :)
For the rest, great job! Cheers!
Andy
March 7th, 2011 at 18:21
Coming back to my previous post, if you just add to your array:
7 => array(),
then this is the result you get (without errors or notices):
Array
(
[1] => 0.15502161271358
[2] => 0.054325746363126
[3] => 0.086204130225571
[4] => 0.15688634961495
[5] => 0.18547554148966
[6] => 0.078666841474062
[7] => 0.093569119238216
[8] => 0.11300382014503
[9] => 0.076846838735805
)
Given your function does the job well, I assume this should be the correct output.
Cheers!
Andy
March 7th, 2011 at 18:35
I'm also not sure about "linking to yourself" as page having any relevance. This should be exempted from computation, but you use it.
In this trivial example:
array(
1 => array(),
2 => array(2),
3 => array(),
4 => array(4)
);
instead of getting:
Array
(
[1] => 0.25
[2] => 0.25
[3] => 0.25
[4] => 0.25
)
you get:
Array
(
[1] => 0.04025475732894
[2] => 0.45974524267106
[3] => 0.04025475732894
[4] => 0.45974524267106
)
which is incorrect.
vin
September 26th, 2011 at 15:11
Hi,
How do we modify the teleport vector to have non uniform preference for the links.
Thanks,
Vin
Wicaksana
November 17th, 2011 at 15:00
I was a student of mathematics
I want to know which part is the markov chain?
because I make this discussion as a Final Project
Thank you
bytza
November 1st, 2012 at 17:57
Realy nice, thankyou
Shrija
March 29th, 2013 at 13:16
Hi.. Great explaination of the PageRank Algorithm.. I'm fairly new to PHP so I was wondering how to run this program. Is it all to be included within one PHP program?