At the recent PHP UK Conference 2012 I had the opportunity to chat about machine learning and IR with a bunch of very smart people. One of the conversations included the always enlightening Rowan Merewood, and was around ranking Twitter friends. It's reasonably well known that Google used to use a variant of PageRank based on who-follows-who to rank it's Twitter search results (back when it had them). The question is, could the same kind of thing work over a much smaller set - say using it to rank the influence users I follow, in order, perhaps, to prioritise tweets?

In part the experiment is also an excuse to play with the PHP Lapack extension, which offers a few useful linear algebra algorithms, including calculating eigenvalues and eigenvectors. This is pertinent in this case as while PageRank can be (and usually is) calculated iteratively, mathematically it can be determined by taking the eigenvector of an appropriate matrix.

In PageRank, there is a grid that models the random walk of a user. It's a mapping between every page and every other page. For a given page (row), there is a value in each column if the row-page links to the column-page. The value of the entry is dependent on how many links there are on the page, representing the chance of a randomly clicking visitor leaving the row-page for the column-page.

In our case, we're using following as our link - on the basis that if a person we're following also follows another person we're following, we're doubly likely to see tweets by them as they may be retweeted or replied to. This is doing more than just counting up followers, as we want to factor in the importance of each follower - so if you are a followed by an important person, you are more likely to be important. Therefore, we'll work on the basis that the people who score highest when taking the PageRank are the most visible, and therefore important or influential.

So the first thing we need to do is to calculate the grid (or matrix) of all users - which is just an array of arrays. First of all we'll do a little caching function to avoid running constantly into the Twitter access limits (which are increased if you're using the authenticated API, which this isn't):

<?php
function getFromCache( $id ) {
    $cacheFile = 'cache/' . $id . ".json";
    if( !file_exists( $cacheFile ) ) {
        $access = is_numeric( $id ) ? 'user_id=' . $id : "screen_name=" . $id;
       
        // Using cURL so we can get the error code
        $ch = curl_init();
        curl_setopt ( $ch, CURLOPT_URL,
                "http://api.twitter.com/1/friends/ids.json?" . $access );
        curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );
        $file = curl_exec( $ch );
       
        if( curl_getinfo( $ch, CURLINFO_HTTP_CODE ) == 401 ) {
            // account for private users
            $data = new stdClass();
            $data->ids = array();
            $file = json_encode($data);
        } else if( curl_getinfo( $ch, CURLINFO_HTTP_CODE ) == 200 ){
            $data = json_decode($file);
        } else {
            die( "Twitter rate limit hit, try again later");
        }
       
        curl_close( $ch );
        file_put_contents( $cacheFile, $file );
    } else {
        $data = json_decode( file_get_contents( $cacheFile ) );
    }
    return $data;
}
?>

We then need to actually build the matrix. This took me a couple of goes due to running out of API calls, but will vary depending on how many followers you're considering. At the end we export the matrix to a file so we can import it later without having to rerun the import:

<?php
    $screenname = "ianbarber";
    $matrix_file = $screenname . "tmatrix.php";

    // First get a list of my followers
    $following = getFromCache( $screenname );
   
    // Now build a matrix of followers
    $userlookup = array_flip( $following->ids );
    $matrix = array();
    $followercount = count( $following->ids );
    foreach( $following->ids as $key => $id ) {
        $user = array_fill( 0, $followercount, 0 );
        $their_following = getFromCache( $id );    
        $intersect = array_intersect( $following->ids, $their_following->ids );
   
        if( count( $intersect ) ) {
            $divisor = 1 / count( $intersect );
            foreach( $intersect as $shared_id ) {
                $user[$userlookup[$shared_id]] = $divisor;
            }
        }
   
        $matrix[$key] = $user;
    }
    file_put_contents( $matrix_file, '<?php $tmatrix = ' . var_export( $matrix, true ) . ";");
?>

This builds up a square matrix, which has a row and column for each user, and a value in a row/column if the row-user follows the column-user. The value is 1 divided by the number of (shared) followers - so if one of the people I follow follows lots of others, each 'vote' counts less than if someone only followed 1 other.

Actually taking the eigenvector is the easy part. The matrix is already in the right format for the Lapack extension, so we just need to call that. To be fair, actually building the Lapack exception is slightly tricky - the extension itself is just pecl install lapack-beta but getting the LAPACKE library (on which the extension depends) built is a bit harder - the best route is described in the extension README, but requires a fairly up to date version of CMake, which might be a problem on older distros. However, once it's installed using is straightforward.

The eigenValues function returns the eigenvalues and optionally eigenvectors for the supplied matrix, which are some inherent aspects of the matrix that maintain their properties during certain transformations. In this case it's the left eigenvector we're interested in.

<?php
    include $matrix_file;
    $vector = array();
    $rc = Lapack::eigenValues( $tmatrix, $vector );
    $result = array();
    foreach( $vector[0] as $key => $value ) {
        $result[$following->ids[$key]] = $value[0];
    }
    arsort( $result );

    $url = "https://api.twitter.com/1/users/lookup.json?user_id=" .
        implode(",", array_slice( array_keys( $result ), 0, 10 ) );
    $contents = json_decode( file_get_contents( $url ) );
    foreach( $contents as $i => $user ) {
        echo $i, ": ", $user->screen_name, " ", "\n";
       
    }
?>

We then just grab the principle values out of the $vector array, and rank the users by them. The results look fairly sane as well:

0: ID_AA_Carmack 
1: echofon 
2: davegardnerisme 
3: minrk 
4: wendydevolder 
5: igorwesome 
6: squirrus 
7: benbyford 
8: kachuru 
9: zoltanray 

John Carmack and echofon coming top makes sense, then you can see some representatives of the clusters of people I follow - @benbyford from work, @davegardnerisme from the grande data crowd, @wendydevolder from skillsmatter and so on. Just a bit of fun, but interesting how simple LAPACK makes calculating the values! The whole code in one handy file is on github.