You are here

Web Spam: The Definitive Guide

April 28, 2010
19
SEO Marketing

Understanding the Boundaries and How to Play Safe

Are you a web spammer? No, seriously, I mean it. If there is one area that a lot of search peeps and marketers aren't always clear on, it's penalties and filters from search engines. This is something you will find very common in SEO circles. We need to look no further than something like duplicate content. While it is (generally) a filter, there is no shortage of people that call it a "duplicate content penalty."

As such, I thought it would be a good idea to look at the many faces of web spam, from the search engineer perspective. This isn't about teaching you how to be a better spammer -- quite the opposite actually, as I am not a fan of that crap. Sure, I have a few mates that play in the black hat world, but they are well aware I am not a fan of it, or polluting the web in general.

This journey is hopefully about helping you avoid tactics, or groups of activities that might put your client or your own websites at risk.

Are SEOs spammers?

Defining Web Spam

What is web spam? In the research for this post this seemed to be the best, or at least most concise, definition I came across:

any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page's true value. (from Web Spam Taxonomy, Stanford)

Hmmmm. Or is it? If this were the case we'd ALL be spammers, since what we do as SEO's is attempt to stack the deck somewhat. Dammit. Oh well. Of further interest, that Stanford paper goes on to say,

An important voice in the web spam area is that of search engine optimizers (SEOs), such as SEO Inc. (www.seoinc.com) or Bruce Clay (www.bruceclay.com).

Ouch. Not nice at all -- how about:

Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming. (emphasis mine)

Holy shizzle -- it reminds us that SEOs aren't criminals, but they are certainly the enemy. Let us diverge somewhat and consider that spamming is the blatant manipulation that adds no value and seeks to leverage the algorithmic blindness of a search algorithm, ok? Lol -- leave it at that. And never forget, they don't like us (SEOs).

Types of Web Spam

There are essentially two types of spamming: boosting and hiding.

Boosting

This is when one takes an action intended to (falsely?) increase or boost the value of a page.

  • Term Spamming: This would be those seeking to manipulate through elements such as the page TITLE (title spam), Meta Description or Meta Keywords (meta spam). As most of us know, two out of three of those were abused to the point where most modern search engines don't use them as signals at all.
  • URL Spamming is another area they've been known to also look at. Yup, strange as it sounds, because there is some weight given to URLs by some search engines, it can be considered to be a manipulation.
  • Link Spamming is another well-known one that also includes anchor text spamming. Search engines consider not only the mass of link spam, but also the anchor text as this is one of the more important signals from a ranking perspective. This section obviously also includes when spammers seek to drop links on pages to increase a target pages value (forums, comments, guest books, etc.) and obviously the more nefarious hack and drop techniques.

Hiding Techniques

This set of techniques is when one is using not generally noticeable methods of getting a page to rank higher. Or more appropriately, the hiding of boosting techniques. These are certainly more problematic and search engines tend to treat them as more insidious than the boosting ones.

  • Content hiding: These are techniques where terms and links are hidden when the browser renders a page. The more common approaches are using color schemes that render the elements in question effectively invisible.
  • Cloaking: We all know this one right? This is when one identifies a search engine crawler and seeks to show a different version of the page to the spider than it would for the average user. This, one assumes, cuts down on the changes of being reported by users or competitors that might otherwise see the spammy page.
  • Redirection: The page is automatically redirected by the browser in the same manner so that the page gets indexed by the engine, but the user will never actually see it. Essentially acting as a proxy/doorway to game the engine, and misdirect the users.

Ways of Detecting Spam

Approaches to Combating Web Spam

Content Spam

Language: In some testing search engineers looked at the actual languages of pages to see what they might find. Of note, French was most commonly found to be a spam fest, with German and English coming in after that. I found that pattern to be interesting.

Domain: I am sure it comes as no surprise that .BIZ domains have been found to have much higher spam rates than any other. This was followed by .US and .COM domains. But the .BIZ were head and shoulders above the others -- stay away from them, ok?

Words per page: Another approach that is often used. What they found that was the pages with more text were often the ones containing more spam. This curve did lessen once over 1500 words. From 750-1500 seemed to be the spammers' sweet spot.

Keywords in page TITLE: This is another area they will look at as testing has shown spam pages tend to use far more KWs in the TITLE element than non-spam pages.

Amount of anchor text: Another interesting approach involves looking at the ratio of text to anchor text on a page. This can be on a page or site level. Websites with a high percentage of anchor text (to standard text) are more likely to be spam sites.

Fraction of visible content: This one pertains to attempts at using hidden text, not to be confused with code to text ratios. They are looking at the percentage of text that is not actually being rendered on the page.

Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning. Search engines often compress a page to save indexation and processing. There is a compression ratio (un-compressed divided by compressed) which likely spam pages will have.

Globally popular words: Another good way to find KW stuffing is to compare the words on the page to existing query data and known documents. Essentially if someone is KW stuffing around given terms, they will be in a more unnatural usage than user queries and known good pages.

Query spam: Given the rise of query analysis, click data and personalization, spammers might seek to query various target terms and click on their own results. By looking at the pattern of the queries, in combination with other signals, these tactics would become statistically apparent.

Host-level spam is looking at other sites and domains on the server and/or registrar level. Much like trust rank, many times spammers will be found in the same neighborhoods with other spammers.

Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.

Link Spam

TrustRank: This method has more than a few names, TrustRank being the Yahoo flavor. The concept revolves around having "good neighbors." Research shows that good sites link to good ones and vice versa. You are known by the company you keep.

Link stuffing: This would be more of an on-site approach where a spammer would create a ton of low-value pages and point all the links (even on-site) to the target page. Spam sites tend to have a higher ratio of these types of un-natural appearances (to a training set of known good pages).

Nepotistic links: Here we would have everything from paid links to traded ones (reciprocal). While this may be a hazy area for SEOs, search engines most certainly believe link manipulation in any reciprocal form to be overt manipulation.

Topological spamming (link farms): While we have our own vernacular on this one, search engines will look at the percentage of links in the graph compared to known "good" sites. Typically those looking to manipulate the engines will have a higher percentage of links from these locals.

Temporal anomalies: Another area where spam sites generally stand out from other pages in the corpus are in the historical data. There will be a mean average of link acquisition and decay with "normal" sites in the index. Temporal data can be used to help detect spammy sites participating in un-natural link building habits.

Lessons for SEOs

What's the point of it all? To me this trail was interesting on a few levels. Let's have a look:

  • Ranking Signals: If we reverse-engineer their reverse engineering of us, we can start to actually see what signals are important and which they wish to protect. Understanding what they're protecting tells us what they consider important. Right?
  • Signal Funnel: Considering the amount of effort put into link spam, we do know that modern link-centric search engines have an interest in less diversified ranking approaches. That is to say, if you NEED links to rank, they know where to look for the spammers. Dealing with web spam is heavily tied to the future of search. Watch and learn.
  • You are the bad guys: As discussed, we're not on most search engineers' Xmas card lists. Know this and understand it. They tolerate us -- even the most well-meaning "white hat" among us.
  • Dampening is more common: Another thing I learned is that more often than not, especially with borderline link spam, the juice would be turned off, not the site de-indexed. Is that a penalty or a filter? Does it matter?
  • Authority/Trust: We would be wise to watch where we play. Building authority and becoming associated with other known entities is at a premium.

As always, it never hurts to understand search engines better if you're going to be optimizing for them. Heck, maybe if we, as a group, begin to understand search engineers and their challenges better, they might speak well of us some day. Naw, that's just a silly dream.

Combinations Create the Spam Signals

One thing that is always important to remember is that in most cases no one signal nor approach is considered definitive. Search engines often employ a variety of methods to find spam. This, for those of us playing nice, means there is a less of a chance of a false positive.

To get your clients or yourself into hot water generally would mean that you would be satisfying more than one element. That being said, most of the folks in the search community aren't big fans of SEO and there are those that feel even the minor "manipulations" should be punishable. From what I know, we need not get too worried about a lynching just yet. Ultimately there are levels and thresholds and as long as you stay clear of tripping too many wires, things should be ok.

One thing is for sure, you will be a much better SEO if you get a better grounding in information retrieval. This post touches on some common aspects -- there's a TON more for those that are interested.

I hope you enjoyed the journey ... play safe!

Patents, Papers and Videos

Before I go, here's a ton of research and reading that you can get into if you want to learn more -- my goal is always to motivate peeps into learning more. No single blog post can do any IR (information retrieval) topic justice. The following are some of the items I looked at when putting this together.

Web Spam Research Papers

TrustRank Concepts

Link Spam

Implicit/Explicit signals

Cloaking

Social Spam

Language/Semantic related

Videos

WebSpam: Dr. Marc Najork - Microsoft Research

Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities.

More Videos:

Patents

Trust-related signals

Query Spam

Link Spam

Cloaking and redirection spam

Other

Now, if that wasn't everything you ever wanted to know about web spam, then I don't know what is!! :0)

David HarryDavid Harry is an SEO and search analyst with Reliable SEO. He also runs the SEO Training Dojo, a top community in the SEO space. You can also track him down via Twitter: @theGypsy.

Comments

Apr 28, 2010

Wow great job putting this together! It's like what not to do!

Apr 28, 2010

This article rocks! So much to take in, but definitely some excellent snippets of information in there I can take onboard straight away to make sure I don't fall foul of Cutts & Co.

Dave (not verified)
Apr 28, 2010

@Evan - that was the motivation (beyond I just love reading the stuff); to create awareness. The last thing we need is to stumble into a bad situation via ignorance. @Ubisan - and U rock too!! lol... I like to think of it as getting into the mindset of a search engineer. The more one thinks like a search engine, the better decisions they'll make in SEO (or so the theory goes). Understanding how to avoid a spammy profile, is one piece of that puzzle. Thanks for takin' the time to comment! I'm glad it got ya thinking!

Apr 28, 2010

Dave- awesome piece. Good thing you had plenty of text otherwise all those anchor text links would have tripped the spam filter. :-)

Well done David! One thing I've been wondering is how long it will be before the voice recognition software is used on the captchas that use audio. It seems like these days there's a finite line between duplicate content on various sites, and aggregated content via rss feeds etc. Matt Cutts of the Google Web Spam team has recently talked about using the canonical tags in feeds to insure sites don't lose credit for their work. I'd definitely like to see an article where this is actually applied.

[...] The Definitive Guide to Web Spam for SEOs | WordStream Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming. (emphasis mine) (tags: Guide WebSpam SEO) Още по темата SEO не е престъплениеGoogle does profile SEOs. They’re identified as “high risk” and so are all of their associated projects. Да им помогнем :-) Още по темата : SEOs are not Criminals ... googleoff googleonОказва се, че има начин Google да бъде спрян да индексира и да показва като snippet само част от текста в дадена страница За целта се използват специалните тагове googleoff и googleon, които се разпознават от Google Bot Пример : Ала бала и текста до тук се индексира <!– googleoff: %s –> Това не се индексира <!– googleon: %s –> Този [...]... bookmarks var fbShare = { size: 'small', badge_text: 'C0C0C0', badge_color: 'CC00FF', } tweetmeme_style = 'compact'; tweetmeme_source = 'oggin'; [...]

Dave (not verified)
Apr 30, 2010

Adam & Arnie, thanks for dropping in and thanks for the comments. It was certainly an interesting post to write/research. I really don't like spammy crap and polluting the web. Sure, I know SEOs that do it, but I still am not a fan. It is one helluva an undertaking for SEs to combat the crap, and some of the solutions are interesting. As for duplicate content, it is more an issue of ending up with over-indexation and fractured link stability that any form of penalty really (unless your site was entirely made up of scraped content of course).

[...] Read the whole story at WordStream Internet Marketing. [...]

May 03, 2010

Great article and well said.

excellent article, thanks for sharing!

[...] The definitive guide to web spam – David Harry [...]

[...] The definitive guide to web spam – David Harry [...]

[...] Over the past couple months a lot of buzz has been generated regarding unethical business practices in the search marketing industry.  We all know the deal – if it’s not spammers claiming the age old claim of guaranteed first page rankings, it’s “independent authorities” selling rankings for profit, lead generation businesses selling badges that let buyers claim they’ve been “rated” as being the best in our industry.  Of course these are just a couple examples of an entire dark side to the business and there are countless more. [...]

wb (not verified)
May 25, 2010

I disagree with this fundamental underlying assumption that something is spam, just because it replicates something else that is on the web. My neighborhood is full of Chinese restaurants. But why shouldn't someone be entitled to open another Chinese restaurant, call it a Chinese restaurant, drop a menu on my doorstep, and attempt to compete for their share of the neighborhood's Chinese food consumption? In all fairness, there's A LOT of garbage on the web, just like there are some lousy Chinese restaurants advertising gourmet fair. But let's not throw out the baby with the bathwater--there's good reason for lots of competition in lucrative markets. Ultimately, the market will allow the best to rise to the top. Meanwhile, the aggressive competition forces them to continuously improve their product and value proposition.

Awesome list, thank you for great patent.

Great aggregation of resources! The way I think of white hat SEO is that I am helping remove barriers to Google fully understanding my content. For instance, instead of silly or clever titles, headlines, etc., I encourage writers to only put what the actual article is about. I think the Twitter culture has helped in this area, because it has created a cultural atmosphere of getting straight to the point in your writing.

[...] Préambule : cet article est une traduction de l’article : Web Spam : The Definitive Guide [...]

[...] a very treacherous path to walk that can easily see you sliding off into the abyss.  Learning what to look for in your own efforts is the first and possibly most important step in cleaning out the spam from [...]

Do you have a source for the statement that the .biz domain contributes more spam than any other domain?   I can find no data supporting this statement.  This page seems to suggest that the .com domain is responsible for most spam:  http://www.surbl.org/tld

Leave a Comment