Web Spam: The Definitive Guide


Understanding the Boundaries and How to Play Safe

Are you a web spammer? No, seriously, I mean it. If there is one area that a lot of search peeps and marketers aren't always clear on, it's penalties and filters from search engines. This is something you will find very common in SEO circles. We need to look no further than something like duplicate content. While it is (generally) a filter, there is no shortage of people that call it a "duplicate content penalty."

As such, I thought it would be a good idea to look at the many faces of web spam, from the search engineer perspective. This isn't about teaching you how to be a better spammer -- quite the opposite actually, as I am not a fan of that crap. Sure, I have a few mates that play in the black hat world, but they are well aware I am not a fan of it, or polluting the web in general.

This journey is hopefully about helping you avoid tactics, or groups of activities that might put your client or your own websites at risk.

Are SEOs spammers?

Defining Web Spam

What is web spam? In the research for this post this seemed to be the best, or at least most concise, definition I came across:

any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page's true value. (from Web Spam Taxonomy, Stanford)

Hmmmm. Or is it? If this were the case we'd ALL be spammers, since what we do as SEO's is attempt to stack the deck somewhat. Dammit. Oh well. Of further interest, that Stanford paper goes on to say,

An important voice in the web spam area is that of search engine optimizers (SEOs), such as SEO Inc. (www.seoinc.com) or Bruce Clay (www.bruceclay.com).

Ouch. Not nice at all -- how about:

Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming. (emphasis mine)

Holy shizzle -- it reminds us that SEOs aren't criminals, but they are certainly the enemy. Let us diverge somewhat and consider that spamming is the blatant manipulation that adds no value and seeks to leverage the algorithmic blindness of a search algorithm, ok? Lol -- leave it at that. And never forget, they don't like us (SEOs).

Types of Web Spam

There are essentially two types of spamming: boosting and hiding.


This is when one takes an action intended to (falsely?) increase or boost the value of a page.

  • Term Spamming: This would be those seeking to manipulate through elements such as the page TITLE (title spam), Meta Description or Meta Keywords (meta spam). As most of us know, two out of three of those were abused to the point where most modern search engines don't use them as signals at all.
  • URL Spamming is another area they've been known to also look at. Yup, strange as it sounds, because there is some weight given to URLs by some search engines, it can be considered to be a manipulation.
  • Link Spamming is another well-known one that also includes anchor text spamming. Search engines consider not only the mass of link spam, but also the anchor text as this is one of the more important signals from a ranking perspective. This section obviously also includes when spammers seek to drop links on pages to increase a target pages value (forums, comments, guest books, etc.) and obviously the more nefarious hack and drop techniques.

Hiding Techniques

This set of techniques is when one is using not generally noticeable methods of getting a page to rank higher. Or more appropriately, the hiding of boosting techniques. These are certainly more problematic and search engines tend to treat them as more insidious than the boosting ones.

  • Content hiding: These are techniques where terms and links are hidden when the browser renders a page. The more common approaches are using color schemes that render the elements in question effectively invisible.
  • Cloaking: We all know this one right? This is when one identifies a search engine crawler and seeks to show a different version of the page to the spider than it would for the average user. This, one assumes, cuts down on the changes of being reported by users or competitors that might otherwise see the spammy page.
  • Redirection: The page is automatically redirected by the browser in the same manner so that the page gets indexed by the engine, but the user will never actually see it. Essentially acting as a proxy/doorway to game the engine, and misdirect the users.

Ways of Detecting Spam

Approaches to Combating Web Spam

Content Spam

Language: In some testing search engineers looked at the actual languages of pages to see what they might find. Of note, French was most commonly found to be a spam fest, with German and English coming in after that. I found that pattern to be interesting.

Domain: I am sure it comes as no surprise that .BIZ domains have been found to have much higher spam rates than any other. This was followed by .US and .COM domains. But the .BIZ were head and shoulders above the others -- stay away from them, ok?

Words per page: Another approach that is often used. What they found that was the pages with more text were often the ones containing more spam. This curve did lessen once over 1500 words. From 750-1500 seemed to be the spammers' sweet spot.

Keywords in page TITLE: This is another area they will look at as testing has shown spam pages tend to use far more KWs in the TITLE element than non-spam pages.

Amount of anchor text: Another interesting approach involves looking at the ratio of text to anchor text on a page. This can be on a page or site level. Websites with a high percentage of anchor text (to standard text) are more likely to be spam sites.

Fraction of visible content: This one pertains to attempts at using hidden text, not to be confused with code to text ratios. They are looking at the percentage of text that is not actually being rendered on the page.

Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning. Search engines often compress a page to save indexation and processing. There is a compression ratio (un-compressed divided by compressed) which likely spam pages will have.

Globally popular words: Another good way to find KW stuffing is to compare the words on the page to existing query data and known documents. Essentially if someone is KW stuffing around given terms, they will be in a more unnatural usage than user queries and known good pages.

Query spam: Given the rise of query analysis, click data and personalization, spammers might seek to query various target terms and click on their own results. By looking at the pattern of the queries, in combination with other signals, these tactics would become statistically apparent.

Host-level spam is looking at other sites and domains on the server and/or registrar level. Much like trust rank, many times spammers will be found in the same neighborhoods with other spammers.

Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.

Link Spam

TrustRank: This method has more than a few names, TrustRank being the Yahoo flavor. The concept revolves around having "good neighbors." Research shows that good sites link to good ones and vice versa. You are known by the company you keep.

Link stuffing: This would be more of an on-site approach where a spammer would create a ton of low-value pages and point all the links (even on-site) to the target page. Spam sites tend to have a higher ratio of these types of un-natural appearances (to a training set of known good pages).

Nepotistic links: Here we would have everything from paid links to traded ones (reciprocal). While this may be a hazy area for SEOs, search engines most certainly believe link manipulation in any reciprocal form to be overt manipulation.

Topological spamming (link farms): While we have our own vernacular on this one, search engines will look at the percentage of links in the graph compared to known "good" sites. Typically those looking to manipulate the engines will have a higher percentage of links from these locals.

Temporal anomalies: Another area where spam sites generally stand out from other pages in the corpus are in the historical data. There will be a mean average of link acquisition and decay with "normal" sites in the index. Temporal data can be used to help detect spammy sites participating in un-natural link building habits.

Lessons for SEOs

What's the point of it all? To me this trail was interesting on a few levels. Let's have a look:

  • Ranking Signals: If we reverse-engineer their reverse engineering of us, we can start to actually see what signals are important and which they wish to protect. Understanding what they're protecting tells us what they consider important. Right?
  • Signal Funnel: Considering the amount of effort put into link spam, we do know that modern link-centric search engines have an interest in less diversified ranking approaches. That is to say, if you NEED links to rank, they know where to look for the spammers. Dealing with web spam is heavily tied to the future of search. Watch and learn.
  • You are the bad guys: As discussed, we're not on most search engineers' Xmas card lists. Know this and understand it. They tolerate us -- even the most well-meaning "white hat" among us.
  • Dampening is more common: Another thing I learned is that more often than not, especially with borderline link spam, the juice would be turned off, not the site de-indexed. Is that a penalty or a filter? Does it matter?
  • Authority/Trust: We would be wise to watch where we play. Building authority and becoming associated with other known entities is at a premium.

As always, it never hurts to understand search engines better if you're going to be optimizing for them. Heck, maybe if we, as a group, begin to understand search engineers and their challenges better, they might speak well of us some day. Naw, that's just a silly dream.

Combinations Create the Spam Signals

One thing that is always important to remember is that in most cases no one signal nor approach is considered definitive. Search engines often employ a variety of methods to find spam. This, for those of us playing nice, means there is a less of a chance of a false positive.

To get your clients or yourself into hot water generally would mean that you would be satisfying more than one element. That being said, most of the folks in the search community aren't big fans of SEO and there are those that feel even the minor "manipulations" should be punishable. From what I know, we need not get too worried about a lynching just yet. Ultimately there are levels and thresholds and as long as you stay clear of tripping too many wires, things should be ok.

One thing is for sure, you will be a much better SEO if you get a better grounding in information retrieval. This post touches on some common aspects -- there's a TON more for those that are interested.

I hope you enjoyed the journey ... play safe!

Patents, Papers and Videos

Before I go, here's a ton of research and reading that you can get into if you want to learn more -- my goal is always to motivate peeps into learning more. No single blog post can do any IR (information retrieval) topic justice. The following are some of the items I looked at when putting this together.

Web Spam Research Papers

TrustRank Concepts

Link Spam

Implicit/Explicit signals


Social Spam

Language/Semantic related


WebSpam: Dr. Marc Najork - Microsoft Research

Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities.

More Videos:


Trust-related signals

Query Spam

Link Spam

Cloaking and redirection spam


Now, if that wasn't everything you ever wanted to know about web spam, then I don't know what is!! :0)

David HarryDavid Harry is an SEO and search analyst with Reliable SEO. He also runs the SEO Training Dojo, a top community in the SEO space. You can also track him down via Twitter: @theGypsy.

Find out how you're REALLY doing in AdWords!

Watch the video below on our Free AdWords Grader:

Visit the AdWords Grader.


Apr 28, 2010

Wow great job putting this together! It's like what not to do!

Apr 28, 2010

This article rocks! So much to take in, but definitely some excellent snippets of information in there I can take onboard straight away to make sure I don't fall foul of Cutts & Co.

Apr 28, 2010

@Evan - that was the motivation (beyond I just love reading the stuff); to create awareness. The last thing we need is to stumble into a bad situation via ignorance.

@Ubisan - and U rock too!! lol... I like to think of it as getting into the mindset of a search engineer. The more one thinks like a search engine, the better decisions they'll make in SEO (or so the theory goes). Understanding how to avoid a spammy profile, is one piece of that puzzle. Thanks for takin' the time to comment! I'm glad it got ya thinking!

Apr 28, 2010

Dave- awesome piece. Good thing you had plenty of text otherwise all those anchor text links would have tripped the spam filter. :-)

Adam Humphreys
Apr 28, 2010

Well done David!

One thing I've been wondering is how long it will be before the voice recognition software is used on the captchas that use audio.

It seems like these days there's a finite line between duplicate content on various sites, and aggregated content via rss feeds etc. Matt Cutts of the Google Web Spam team has recently talked about using the canonical tags in feeds to insure sites don't lose credit for their work. I'd definitely like to see an article where this is actually applied.

Apr 30, 2010

Adam & Arnie, thanks for dropping in and thanks for the comments. It was certainly an interesting post to write/research. I really don't like spammy crap and polluting the web. Sure, I know SEOs that do it, but I still am not a fan. It is one helluva an undertaking for SEs to combat the crap, and some of the solutions are interesting. As for duplicate content, it is more an issue of ending up with over-indexation and fractured link stability that any form of penalty really (unless your site was entirely made up of scraped content of course).

Guidelines To Avoid SEO Spam - Online Marketing Daily
Apr 30, 2010

[...] Read the whole story at WordStream Internet Marketing. [...]

May 03, 2010

Great article and well said.

Craig Broadbent
May 04, 2010

excellent article, thanks for sharing!

Link Building this Month (04.2010) | Wiep.net
May 04, 2010

[...] The definitive guide to web spam – David Harry [...]

Link Building this Month (04.2010) »
May 19, 2010

[...] The definitive guide to web spam – David Harry [...]

Eliminating the Profit Motive in Unethical SEO | SEO Facts
May 25, 2010

[...] Over the past couple months a lot of buzz has been generated regarding unethical business practices in the search marketing industry.  We all know the deal – if it’s not spammers claiming the age old claim of guaranteed first page rankings, it’s “independent authorities” selling rankings for profit, lead generation businesses selling badges that let buyers claim they’ve been “rated” as being the best in our industry.  Of course these are just a couple examples of an entire dark side to the business and there are countless more. [...]

May 25, 2010

I disagree with this fundamental underlying assumption that something is spam, just because it replicates something else that is on the web.

My neighborhood is full of Chinese restaurants. But why shouldn't someone be entitled to open another Chinese restaurant, call it a Chinese restaurant, drop a menu on my doorstep, and attempt to compete for their share of the neighborhood's Chinese food consumption?

In all fairness, there's A LOT of garbage on the web, just like there are some lousy Chinese restaurants advertising gourmet fair. But let's not throw out the baby with the bathwater--there's good reason for lots of competition in lucrative markets. Ultimately, the market will allow the best to rise to the top. Meanwhile, the aggressive competition forces them to continuously improve their product and value proposition.

another spammer
Jun 18, 2010

Awesome list, thank you for great patent.

Dan Gayle
Jul 20, 2010

Great aggregation of resources! The way I think of white hat SEO is that I am helping remove barriers to Google fully understanding my content. For instance, instead of silly or clever titles, headlines, etc., I encourage writers to only put what the actual article is about. I think the Twitter culture has helped in this area, because it has created a cultural atmosphere of getting straight to the point in your writing.

Web Spam : le guide SEO Spamdexing
Jul 23, 2010

[...] Préambule : cet article est une traduction de l’article : Web Spam : The Definitive Guide [...]

Everyone Should Hate Spam « The Techndu Blog
Nov 08, 2010

[...] a very treacherous path to walk that can easily see you sliding off into the abyss.  Learning what to look for in your own efforts is the first and possibly most important step in cleaning out the spam from [...]

agen casino
Jan 16, 2016

Yes i am totally agreed with this article and i just want say that this article is very nice and very informative article.I will make sure to be reading your blog more. You made a good point but I can't help but wonder, wh at about the other side? !!!!!!THANKS!!!!!!

agen casino
Jan 16, 2016

This is definitely really a brilliant post, many thanks for telling Excellent luck I found out about this specific site..

Mar 09, 2017

Excellent post!!!!

May 27, 2017

Ann, I may have professional feelings for you.

I'll add:

Build links that people want to click. Track referrals from those links. Track a few hot keyword rankings from referring URLs and apply Slingshot SEO's click-through-rate study numbers. Feel good about all that. Tell a friend. Sleep.

May 27, 2017

That's a great point, and I see the same thing with social. Much of the value of guest-blogging is relationship building, IMO. Almost all of my business has come from online relationship-building via guest-blogging, blog/forum participation, and social media. Even if I never got a link from any of those sources, I'd still be making money.

What's ironic is that everything in that list can be (and is) used in a spammy, low-quality way. If you paste random, link-filled comments to 100s of irrelevant blogs, it's spam. If you thoughtfully engage on sites regularly and connect to your audience, it's good marketing.

Leave a comment