If you asked most people how a search engine worked, their answer would likely be a far cry from the acres of servers and vast collections that Google queries millions of times a day. That said, the intuitive view of a search engine is in many ways just a series of incremental steps away from Mountain View.
The basic idea might be: Get a lot of documents » Accept a query » Look for the words from the query in the documents » Return all the documents that contain the words.
In fact, in a lot of cases that works - if you have a few text files then using grep will quickly bring up a list of matches. In it's most basic form, grep just searches through each document looking for exact matches to the search query - usually just a word or short phrase.
Of course, grepping doesn't scale to large collections of documents so well, and users of early information retrieval systems wanted more power than a simple string match for their queries, which lead to the development of boolean retrieval systems. These were the most popular form of IR basically up until the web came along (and are still used in some legal and medical systems). As opposed to what might be thought of as the 'natural language search' where you type in just what you're looking for, php search engine for example, boolean search specifies operators in the query: PHP AND search AND engine NOT lucene, for example. The way that these queries are handled is tied in to the way the scaling problem is handled - through the use of indexes.
Indexing and boolean queries
For our examples, we'll stick with a fairly limited set of just 3 boolean operators. A regular search query, say php information retrieval is implicitly a logical AND of the various words: find me a document that contains the word "php" AND contains the word "information" AND contains the word "retrieval".
To this, we can add the OR and NOT operations. For example we could search for php OR ruby NOT python, meaning any document that contains the word "php", but doesn't contain the word "python", plus any document that contains the word "ruby", but doesn't contain the word "python".
Rather than query by grepping each document, we'll index the documents, and then use the index to help us determine which documents match the query. To keep things simple for now, we'll assume that all the documents are simple text files - getting to that point with a variety of formats, including structured files such as HTML and binary formats like Word's .doc is a whole problem area of its own. We'll also ignore the related issue of retrieving all those documents, assuming someone has helpfully collected everything that need to be searched into one place for us.
The steps we will perform in indexing a document are:
- Generate a list of tokens from each document
- Store the address of each document against a unique document id
- Construct a dictionary of all the tokens
- Create an inverted index mapping tokens to the documents that contain them
Tokenising a document is usually a case of splitting the document into words - though there are certainly many applications that use non-word tokens extracted from the text. There are a huge variety of options around tokenising, but at it's simplest we can split on runs of characters, ignoring punctuation, and then filter out duplicates:
Repeated across all our documents, we end with a list of pairs: documentId -> term. Once we have the list we can invert it to give a list of term -> document pairs. If we sort these by the term, we can easily extract a list of all the documents which contain a give term - the posting list for that term - and we have built our first inverted index. For right now we'll treat it as if we can process the whole lot at once, and just combine these steps to build the inverted index directly:
This will work fine (and pretty quickly) for rather small collections, but as they get larger the size of the posting lists for each term will become pretty significant, enough so that we don't want to have to have the whole thing in memory for every request, even if we can afford to while indexing.
The first step is usually to split the structure so we have have two elements: a dictionary that we store the tokens in, and the inverted index which stores the postings. We can store the index on disk, and keep the dictionary in memory (perhaps using APC or Memcache), then load only the relevant parts of the index for a given query as required.
For performance reasons, we will want to be working on the smallest set of documents we can at any time, so we'll store the count of documents listed against each term, the document frequency, in the dictionary as well. Again, we can use a PHP array to represent this:
We can store the posting lists on disk in simple arrays by the tokenID - this will run into problems when we have a really large number of terms, but will do for small collections.
With our documents indexed, we can now look at processing a query.
Handling Search Queries
First, we need to tokenise the search query in the same way we tokenised the documents. In addition to that, we want some processing that will allow us to handle the boolean syntax. We can create a simple structure to represent the query*:
If we called that with a simple search, we'd end up with a tree like:
The important observation is that the query is a series of set operations on the documents that are listed against each term. For example if we had a query that consisted of term1 AND term2, then the resulting documents would be all of the documents listed against term1 that were also listed against term2 - the intersection. Term1 NOT term2 means all of the documents that are listed against term1 that are not in the list of documents against term2 - the relative complement. Term1 OR term2 simply means all of the documents against term1 plus all of the documents against term2, the union of the two sets.
All of these operations are based around the concept of a merge, where we compare the documents in one list to the documents in another. We can do these sequentially, so in the case of (term1 AND term2) NOT term3 we first calculate all the documents that are listed against both term1 and term2, then compare the resulting list and the list of documents against term3, removing any that match.
Here's a simple intersect function that demonstrates the AND case, assuming we have regular arrays** of posting that look something like:
The function just loops through one set, while incrementing the second set until it finds a match, or the current ID in the second set is bigger than that in the first.
It doesn't take much tweaking to handle the NOT case, and the OR case is just adding the lists togethe. However, rather than reinvent the wheel in this case we can take advantage of the built in PHP functions pretty easily for the whole lot:
We can pull the the above functions into a controller that calculates a list of documents based on the query we parsed before (assume the function getPostingsForTerm just looks up the term in the dictionary and retrieves the posting list from disk):
This isn't the best query execution function though! One useful optimisation is to try to look at terms with the shortest posting lists first, using the document frequency we stored in the dictionary. This means we'll be comparing the shortest list we can at each point, potentially saving quite a lot of comparisons, but meaning we have to reorder the tree. Hopefully the general idea is clear though.
Boolean search is far from ideal - it's not very intuitive, and people tend to expect the operators to mean the opposite of what they do. However, it is an important first step towards the two dominant models for search: vector based and probability based. In later posts we'll look at some more things we can do with boolean search - phrase searching, wildcard matches and a new way of representing the dictionary, but hopefully this should have covered the basics of this (simple) form of IR!
For anyone interested in the code, there is a working example on github.
- - We're ignoring some edge cases now, and we're assuming an explicit AND between each token that is ANDed (though it's not a tricky extension to fix it up to allow implicit ANDs). ** - We're eventually want to use a linked list here, as it makes inserting new tokens a bit easier if documents are updated, and has some other handy properties.