Are you a web spammer? No, seriously, I mean it. If there is one area that a lot of search peeps and marketers aren’t always clear on, it’s penalties and filters from search engines. This is something you will find very common in SEO circles. We need to look no further than something like duplicate content. While it is (generally) a filter, there is no shortage of people that call it a “duplicate content penalty.”
As such, I thought it would be a good idea to look at the many faces of web spam, from the search engineer perspective. This isn’t about teaching you how to be a better spammer — quite the opposite actually, as I am not a fan of that crap. Sure, I have a few mates that play in the black hat world, but they are well aware I am not a fan of it, or polluting the web in general.
This journey is hopefully about helping you avoid tactics, or groups of activities that might put your client or your own websites at risk.
What is web spam? In the research for this post this seemed to be the best, or at least most concise, definition I came across:
any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value. (from Web Spam Taxonomy, Stanford)
Hmmmm. Or is it? If this were the case we’d ALL be spammers, since what we do as SEO’s is attempt to stack the deck somewhat. Dammit. Oh well. Of further interest, that Stanford paper goes on to say,
An important voice in the web spam area is that of search engine optimizers (SEOs), such as SEO Inc. (www.seoinc.com) or Bruce Clay (www.bruceclay.com).
Ouch. Not nice at all — how about:
Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call “ethical” web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming. (emphasis mine)
Holy shizzle — it reminds us that SEOs aren’t criminals, but they are certainly the enemy. Let us diverge somewhat and consider that spamming is the blatant manipulation that adds no value and seeks to leverage the algorithmic blindness of a search algorithm, ok? Lol — leave it at that. And never forget, they don’t like us (SEOs).
There are essentially two types of spamming: boosting and hiding.
This is when one takes an action intended to (falsely?) increase or boost the value of a page.
This set of techniques is when one is using not generally noticeable methods of getting a page to rank higher. Or more appropriately, the hiding of boosting techniques. These are certainly more problematic and search engines tend to treat them as more insidious than the boosting ones.
Language: In some testing search engineers looked at the actual languages of pages to see what they might find. Of note, French was most commonly found to be a spam fest, with German and English coming in after that. I found that pattern to be interesting.
Domain: I am sure it comes as no surprise that .BIZ domains have been found to have much higher spam rates than any other. This was followed by .US and .COM domains. But the .BIZ were head and shoulders above the others — stay away from them, ok?
Words per page: Another approach that is often used. What they found that was the pages with more text were often the ones containing more spam. This curve did lessen once over 1500 words. From 750-1500 seemed to be the spammers’ sweet spot.
Keywords in page TITLE: This is another area they will look at as testing has shown spam pages tend to use far more KWs in the TITLE element than non-spam pages.
Amount of anchor text: Another interesting approach involves looking at the ratio of text to anchor text on a page. This can be on a page or site level. Websites with a high percentage of anchor text (to standard text) are more likely to be spam sites.
Fraction of visible content: This one pertains to attempts at using hidden text, not to be confused with code to text ratios. They are looking at the percentage of text that is not actually being rendered on the page.
Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning. Search engines often compress a page to save indexation and processing. There is a compression ratio (un-compressed divided by compressed) which likely spam pages will have.
Globally popular words: Another good way to find KW stuffing is to compare the words on the page to existing query data and known documents. Essentially if someone is KW stuffing around given terms, they will be in a more unnatural usage than user queries and known good pages.
Query spam: Given the rise of query analysis, click data and personalization, spammers might seek to query various target terms and click on their own results. By looking at the pattern of the queries, in combination with other signals, these tactics would become statistically apparent.
Host-level spam is looking at other sites and domains on the server and/or registrar level. Much like trust rank, many times spammers will be found in the same neighborhoods with other spammers.
Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.
TrustRank: This method has more than a few names, TrustRank being the Yahoo flavor. The concept revolves around having “good neighbors.” Research shows that good sites link to good ones and vice versa. You are known by the company you keep.
Link stuffing: This would be more of an on-site approach where a spammer would create a ton of low-value pages and point all the links (even on-site) to the target page. Spam sites tend to have a higher ratio of these types of un-natural appearances (to a training set of known good pages).
Nepotistic links: Here we would have everything from paid links to traded ones (reciprocal). While this may be a hazy area for SEOs, search engines most certainly believe link manipulation in any reciprocal form to be overt manipulation.
Topological spamming (link farms): While we have our own vernacular on this one, search engines will look at the percentage of links in the graph compared to known “good” sites. Typically those looking to manipulate the engines will have a higher percentage of links from these locals.
Temporal anomalies: Another area where spam sites generally stand out from other pages in the corpus are in the historical data. There will be a mean average of link acquisition and decay with “normal” sites in the index. Temporal data can be used to help detect spammy sites participating in un-natural link building habits.
What’s the point of it all? To me this trail was interesting on a few levels. Let’s have a look:
As always, it never hurts to understand search engines better if you’re going to be optimizing for them. Heck, maybe if we, as a group, begin to understand search engineers and their challenges better, they might speak well of us some day. Naw, that’s just a silly dream.
One thing that is always important to remember is that in most cases no one signal nor approach is considered definitive. Search engines often employ a variety of methods to find spam. This, for those of us playing nice, means there is a less of a chance of a false positive.
To get your clients or yourself into hot water generally would mean that you would be satisfying more than one element. That being said, most of the folks in the search community aren’t big fans of SEO and there are those that feel even the minor “manipulations” should be punishable. From what I know, we need not get too worried about a lynching just yet. Ultimately there are levels and thresholds and as long as you stay clear of tripping too many wires, things should be ok.
One thing is for sure, you will be a much better SEO if you get a better grounding in information retrieval. This post touches on some common aspects — there’s a TON more for those that are interested.
I hope you enjoyed the journey … play safe!
Before I go, here’s a ton of research and reading that you can get into if you want to learn more — my goal is always to motivate peeps into learning more. No single blog post can do any IR (information retrieval) topic justice. The following are some of the items I looked at when putting this together.
WebSpam: Dr. Marc Najork – Microsoft Research
Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities.
Cloaking and redirection spam
Now, if that wasn’t everything you ever wanted to know about web spam, then I don’t know what is!! :0)
David Harry is an SEO and search analyst with Reliable SEO. He also runs the SEO Training Dojo, a top community in the SEO space. You can also track him down via Twitter: @theGypsy.
Please read our Comment Policy before commenting.