Method for detecting link spam in hyperlinked databases Google Patent Analysis

Oh, happy day, Google have been granted a patent on web link spam.Â Right, Bill may say it better, and may even look over the maths, but in very rough terms, I think the patent says this :

A site may seek to improve it’s apparent importance by acquiring spammy links. This can sometimes be detected by examining the derivative graph of average PR (NB this is a gross over simplification, but it’ll do) of newly acquired links, since such spammy links often come in 2 varieties :

1) Link farm : often composed of many low PR pages pointing to a single site, hence the graph of d(PR)/dt will be sharply negative

2) Clique attack : often composed of many high value sites linking amongst themselves, but not outside the ring, hence the graph of d(PR)/dt will be sharply positive

Neither behaviour is terribly natural, and can be examined algorithmically. Also, although it’s not clearly stated, there must be a temporal component to this – only links acquired within a given timeframe can truly be considered to be coupled, else madness ensues. Personally, I’d put a damping factor into the coupling that relies on time separation between discovery of various links. It’s known that Google timestamp new links when they find them, so it’s not a problem

Another critical line here “A naturally occurring structure, in contrast, will tend to have more links to nodes outside the ring, thereby dissipating the importance of mutual reinforcement of the links.” – link out from your spam, for protection from detection +3. Fortunately this merely reinforces my own practices, really. I’ve noticed that a lot of â€œnew waveâ€ SEOs think that â€œPR bleedâ€ is some horrific disease, and not a stupid name for the perfectly normal practice of linking to other people’s sites, like an actual webmaster might do. Repeat after me, hublike score, LocalRank, Hilltop.

Back to the patent… Interestingly, they propose taking the modulus of the normalised derivative value, and comparing that to a threshhold to find candidates for spam status, which may imply that the “natural” graphs of link acquisition vary very little. In clearer terms, that means that you’d expect a site to pick up a mixture of high-value and low-value links, and that the derivate contributions of each of those types of links ought to produce a fairly flat graph overall. That may go some way to explaining why the Digg / SMO boosts don’t always work.

Note that the patent was submitted in 2004 – it’s likely they’ve been using this for a while, and I suspect we’ve all seen the effects. It strikes me most immediately as a huge helping hand to negative SEO – thanks for spelling out more clearly what I need to do to make someone look like a spammer Sepandar D. Kamvar, Taher H. Haveliwala, and Glen M. Jeh. I owe you all a beer.

Footnote:

This analysis was sent to me by my good friend Brendon Scott (Aka TallTroll) who has been involved in SEO for a lifetime. If you can find him maybe you could hire him, a bit like the A Team really.

0 Comments on “Method for detecting link spam in hyperlinked databases Google Patent Analysis”