Determining whether two documents are exactly the same is pretty easy, just use some suitably sized hash and look for a match. A document will generally only hash to the same as another though if they are identical - the smallest change, or the same content on another site with a different header and footer, for example, will cause the hash to be quite different. These near duplicates are very common, and being able to detect them can be useful in a whole range of situations. Shingling is one process for relatively cheaply detecting these duplicates.
The shingle, or more precisely w-shingle, is a continuous w word long sequence, and the complete w-shingling of a document is every unique w length sequence in that document. For example, for the phrase "to be or not to be, that is the question", the 4-shingling would be:
- "to be or not"
- "be or not to"
- "or not to be"
- "not to be that"
- "to be that is"
- "be that is the"
- "that is the question"
Any duplicates would be removed. If we had the two sets of shingles for two different document we could calculate the Jaccard coefficient, which gives a similarity score between two sets. The coefficient is defined as the size of the intersection, the number of shingles present in both, divided by the size of the union, number of shingles present in either. If the two documents are identical, then every shingle will be present in both documents, and the two numbers will be the same, giving a score of 1. If some shingles are present in one but not the other they will not be counted in the intersection, so the score will be less than one, and if the two documents don't share any shingles their score will be 0.
Note that arrayintersect correctly returns only one copy of any entries which are shared between the two arrays, but we can't easily do the equivalent union operation unless the arrays are keyed, because arraymerge(array(1,2,3), array(3,4,5)) would give array(1,2,3,3,4,5) instead of the array(1,2,3,4,5) we'd want. The easiest way to work around this is to just count the ones in the intersection twice, since they must have been in both sets.
If we had this similarity score for a pair of documents we could define a threshhold, and say that any pairs of documents that are over a given score are near-duplicates. However, to check for duplicates in a collection we're still effectively comparing all the words against each other for each pair of documents, which is not likely to be cheap!
Estimating & Sampling
The solution is to estimate the resemblance by just taking a sampling of the shingles. If we hash the shingles, using a hash with a big enough number of bits, we effectively get a unique number for each one. We can that take the modulo of that number, and keep only the ones that are modulo of some chosen value. This gives us a much smaller set of shingles to work with, but distributed in such a way that similar documents are still likely to share shingles. We can then store pairs of shingle ID -> document mappings, and use almost an inverted index style structure to provide quick access to potential near-duplicates.
To keep things interesting, lets see if we can detect some near duplicates in what might be called a target rich environment, Twitter. Here's the code for a simple shingling class. Note that the modulo is quite low here, at 3, and could actually increase a lot without a loss of effectiveness, particularly for documents longer than 140 characters.
Here we assume we're looking for documents that are passed in, but we could easily add a function to check for near-duplicates inside the collection. The documents are broken down into their w-shingles, and the ones that fit the modulo stored. The numeric value for each shingle is calculated by taking the first few characters of an MD5.
The check process works by accumulating document shingle matches, so each time a document from the index has a shingle that matches with the test document, its score is increased by one. At the end, the score is calculated as a ratio of matched shingles over available shingles (in the question document), and the values compared to a threshhold. This threshold will be in the range 0-1, with the lower end returning documents that matched on a few shingles, the upper end being ones that matched on most.
We can try this on twitter data, as twitter often includes a lot of reposting of other peoples messages. These usually have RT prepended, but may also have some other edits (to fit into the limit, or to add some commentary for example). I grabbed 45 tweets from the search "phpnw09", the hashtag for the PHP Northwest conference, to check for matches. Two were removed for testing, one which had been retweeted (with some modifications) and one that hadn't.
We can run this a couple of times, just to check we get what we expect. The first test is a message that was retweeted a few times in the data: "phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t!". We get the following results:
1: phpnw08: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! 2: PHPNW: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! 3: phpcodemonkey: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! 4: juokaz: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- Woot!! :) 5: DragonBe: RT @phpnw: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! 6: jakubzalas: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! 7: AnthonySterling: RT: @phpnw09 1 Week 'til #phpnw09 - let me hear you say w00t! - w00t! :-D 8: rubygem: RT @phpcodemonkey: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! 9: oatie: RT @phpcodemonkey: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! 10: DASPRiD: RT @DragonBe RT @phpnw: RT @phpnw09: 1 Week 'til #phpnw09 - let me hear you say w00t! <-- w00t! <-- w00t
A tweet that was not referenced: "xim123: @loonytoons Well I've booked train tickets for saturday now (despite thetrainline best efforts to prevent me), see you there #phpnw09" returns 0 similar documents, exactly was expected.
For better performance, shingling can be extended by shingling the shingles to get (believe it or not) supershingles. On a reasonable length document the chance of two similar documents sharing a supershingle is high, so this can save a lot of checking and give fast lookups, but might have problems on shorter documents, or when length is very variable.
Shingles can also be used for clustering. To do this, simply calculate the score between every pair of documents in the system, and keep those pairs that score over a threshold. One application from the original paper was to allow a "lost and found" service for moved web pages, by identifying where documents had moved to once they were no longer available on a given url by updating the clusters and tagging URLs with a date.