Web Spam: The Definitive Guide
Understanding the Boundaries and How to Play Safe
Are you a web spammer? No, seriously, I mean it. If there is one area that a lot of search peeps and marketers aren't always clear on, it's penalties and filters from search engines. This is something you will find very common in SEO circles. We need to look no further than something like duplicate content. While it is (generally) a filter, there is no shortage of people that call it a "duplicate content penalty."
As such, I thought it would be a good idea to look at the many faces of web spam, from the search engineer perspective. This isn't about teaching you how to be a better spammer -- quite the opposite actually, as I am not a fan of that crap. Sure, I have a few mates that play in the black hat world, but they are well aware I am not a fan of it, or polluting the web in general.
This journey is hopefully about helping you avoid tactics, or groups of activities that might put your client or your own websites at risk.
Defining Web Spam
What is web spam? In the research for this post this seemed to be the best, or at least most concise, definition I came across:
any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page's true value. (from Web Spam Taxonomy, Stanford)
Hmmmm. Or is it? If this were the case we'd ALL be spammers, since what we do as SEO's is attempt to stack the deck somewhat. Dammit. Oh well. Of further interest, that Stanford paper goes on to say,
An important voice in the web spam area is that of search engine optimizers (SEOs), such as SEO Inc. (www.seoinc.com) or Bruce Clay (www.bruceclay.com).
Ouch. Not nice at all -- how about:
Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming. (emphasis mine)
Holy shizzle -- it reminds us that SEOs aren't criminals, but they are certainly the enemy. Let us diverge somewhat and consider that spamming is the blatant manipulation that adds no value and seeks to leverage the algorithmic blindness of a search algorithm, ok? Lol -- leave it at that. And never forget, they don't like us (SEOs).
Types of Web Spam
There are essentially two types of spamming: boosting and hiding.
This is when one takes an action intended to (falsely?) increase or boost the value of a page.
- Term Spamming: This would be those seeking to manipulate through elements such as the page TITLE (title spam), Meta Description or Meta Keywords (meta spam). As most of us know, two out of three of those were abused to the point where most modern search engines don't use them as signals at all.
- URL Spamming is another area they've been known to also look at. Yup, strange as it sounds, because there is some weight given to URLs by some search engines, it can be considered to be a manipulation.
- Link Spamming is another well-known one that also includes anchor text spamming. Search engines consider not only the mass of link spam, but also the anchor text as this is one of the more important signals from a ranking perspective. This section obviously also includes when spammers seek to drop links on pages to increase a target pages value (forums, comments, guest books, etc.) and obviously the more nefarious hack and drop techniques.
This set of techniques is when one is using not generally noticeable methods of getting a page to rank higher. Or more appropriately, the hiding of boosting techniques. These are certainly more problematic and search engines tend to treat them as more insidious than the boosting ones.
- Content hiding: These are techniques where terms and links are hidden when the browser renders a page. The more common approaches are using color schemes that render the elements in question effectively invisible.
- Cloaking: We all know this one right? This is when one identifies a search engine crawler and seeks to show a different version of the page to the spider than it would for the average user. This, one assumes, cuts down on the changes of being reported by users or competitors that might otherwise see the spammy page.
- Redirection: The page is automatically redirected by the browser in the same manner so that the page gets indexed by the engine, but the user will never actually see it. Essentially acting as a proxy/doorway to game the engine, and misdirect the users.
Approaches to Combating Web Spam
Language: In some testing search engineers looked at the actual languages of pages to see what they might find. Of note, French was most commonly found to be a spam fest, with German and English coming in after that. I found that pattern to be interesting.
Domain: I am sure it comes as no surprise that .BIZ domains have been found to have much higher spam rates than any other. This was followed by .US and .COM domains. But the .BIZ were head and shoulders above the others -- stay away from them, ok?
Words per page: Another approach that is often used. What they found that was the pages with more text were often the ones containing more spam. This curve did lessen once over 1500 words. From 750-1500 seemed to be the spammers' sweet spot.
Keywords in page TITLE: This is another area they will look at as testing has shown spam pages tend to use far more KWs in the TITLE element than non-spam pages.
Amount of anchor text: Another interesting approach involves looking at the ratio of text to anchor text on a page. This can be on a page or site level. Websites with a high percentage of anchor text (to standard text) are more likely to be spam sites.
Fraction of visible content: This one pertains to attempts at using hidden text, not to be confused with code to text ratios. They are looking at the percentage of text that is not actually being rendered on the page.
Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning. Search engines often compress a page to save indexation and processing. There is a compression ratio (un-compressed divided by compressed) which likely spam pages will have.
Globally popular words: Another good way to find KW stuffing is to compare the words on the page to existing query data and known documents. Essentially if someone is KW stuffing around given terms, they will be in a more unnatural usage than user queries and known good pages.
Query spam: Given the rise of query analysis, click data and personalization, spammers might seek to query various target terms and click on their own results. By looking at the pattern of the queries, in combination with other signals, these tactics would become statistically apparent.
Host-level spam is looking at other sites and domains on the server and/or registrar level. Much like trust rank, many times spammers will be found in the same neighborhoods with other spammers.
Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.
TrustRank: This method has more than a few names, TrustRank being the Yahoo flavor. The concept revolves around having "good neighbors." Research shows that good sites link to good ones and vice versa. You are known by the company you keep.
Link stuffing: This would be more of an on-site approach where a spammer would create a ton of low-value pages and point all the links (even on-site) to the target page. Spam sites tend to have a higher ratio of these types of un-natural appearances (to a training set of known good pages).
Nepotistic links: Here we would have everything from paid links to traded ones (reciprocal). While this may be a hazy area for SEOs, search engines most certainly believe link manipulation in any reciprocal form to be overt manipulation.
Topological spamming (link farms): While we have our own vernacular on this one, search engines will look at the percentage of links in the graph compared to known "good" sites. Typically those looking to manipulate the engines will have a higher percentage of links from these locals.
Temporal anomalies: Another area where spam sites generally stand out from other pages in the corpus are in the historical data. There will be a mean average of link acquisition and decay with "normal" sites in the index. Temporal data can be used to help detect spammy sites participating in un-natural link building habits.
Lessons for SEOs
What's the point of it all? To me this trail was interesting on a few levels. Let's have a look:
- Ranking Signals: If we reverse-engineer their reverse engineering of us, we can start to actually see what signals are important and which they wish to protect. Understanding what they're protecting tells us what they consider important. Right?
- Signal Funnel: Considering the amount of effort put into link spam, we do know that modern link-centric search engines have an interest in less diversified ranking approaches. That is to say, if you NEED links to rank, they know where to look for the spammers. Dealing with web spam is heavily tied to the future of search. Watch and learn.
- You are the bad guys: As discussed, we're not on most search engineers' Xmas card lists. Know this and understand it. They tolerate us -- even the most well-meaning "white hat" among us.
- Dampening is more common: Another thing I learned is that more often than not, especially with borderline link spam, the juice would be turned off, not the site de-indexed. Is that a penalty or a filter? Does it matter?
- Authority/Trust: We would be wise to watch where we play. Building authority and becoming associated with other known entities is at a premium.
As always, it never hurts to understand search engines better if you're going to be optimizing for them. Heck, maybe if we, as a group, begin to understand search engineers and their challenges better, they might speak well of us some day. Naw, that's just a silly dream.
Combinations Create the Spam Signals
One thing that is always important to remember is that in most cases no one signal nor approach is considered definitive. Search engines often employ a variety of methods to find spam. This, for those of us playing nice, means there is a less of a chance of a false positive.
To get your clients or yourself into hot water generally would mean that you would be satisfying more than one element. That being said, most of the folks in the search community aren't big fans of SEO and there are those that feel even the minor "manipulations" should be punishable. From what I know, we need not get too worried about a lynching just yet. Ultimately there are levels and thresholds and as long as you stay clear of tripping too many wires, things should be ok.
One thing is for sure, you will be a much better SEO if you get a better grounding in information retrieval. This post touches on some common aspects -- there's a TON more for those that are interested.
I hope you enjoyed the journey ... play safe!
Patents, Papers and Videos
Before I go, here's a ton of research and reading that you can get into if you want to learn more -- my goal is always to motivate peeps into learning more. No single blog post can do any IR (information retrieval) topic justice. The following are some of the items I looked at when putting this together.
Web Spam Research Papers
- Spam Double-Funnel: Connecting Web Spammers with Advertisers - the Search Ranger system
- Detecting Spam Web Pages through Content Analysis - Microsoft
- Improving web spam classification using rank-time features - (AIRWeb 2007)
- Adversarial Information Retrieval on the Web - (AIRWeb 2007)
- Web Spam Detection Using Decision Trees - Indian Institute of Information Technology
- Web Spam Detection: link-based and content-based techniques - Yahoo
- Web spam Identification Through Content and Hyperlinks - Yahoo
- Combating Web Spam with TrustRank - Stanford 2004
- Propagating Trust and Distrust to Demote Web Spam - Lehigh University
- Recognizing Nepotistic Links on the Web - B.Davison
- Detecting nepotistic links by language model disagreement
- Link Spam Alliances - Stanford
- Know your Neighbors: Web Spam Detection using the Web Topology - Yahoo
- Identifying excessively reciprocal links among web entities - Yahoo (patent)
- Link Based Small Sample Learning for Web Spam Detection - Chinese Academy of Sciences
- Undue influence: eliminating the impact of link plagiarism on web search rankings - B Wu, BD Â
- Detecting link spam using temporal information - Microsoft
- Extracting link spam using biased random walks from spam seed sets - B Wu, K Chellapilla
- Link Analysis for Web Spam Detection - Yahoo Research
- Link Spam Detection Based on Mass Estimation - Stanford
- Link Based Characterization and Detection of Web Spam - Yahoo
- Identifying Web Spam with User Behaviour Analysis - AIRweb
- User Behavior Oriented Web Spam Detection - WWW
- Web Spam Detection via Commercial Intent Analysis - Andras Benczur, Istvan Biro, Karoly Csalogany
- Query-log mining for detecting spam - Yahoo
- Cloaking and Redirection: - A Preliminary Study by Lehigh University.
- Detecting Semantic Cloaking on the Web - Lehigh University
- The Anti-Social Tagger â€“ Detecting Spam in Social Bookmarking Systems - AirWeb
- An Empirical Study on Selective Sampling in Active Learning for Splog Detection - AIRweb
- Identifying Video Spammers in Online Social Networks - Polytechnic University
- Social Spam Detection - Indiana University
- Web spam identification through language model analysis - AIRweb
- Detecting spam web pages through content analysis - Microsoft
- Exploring Linguistic Features for Web Spam Detection: A Preliminary Study - Various authors
WebSpam: Dr. Marc Najork - Microsoft Research
Topics include search advertising and auctions, search and privacy, search ranking, internationalization, anti-spam efforts, local search, peer-to-peer search, and search of blogs and online communities.
- Using Rank Propagation and Probabilistic Counting for Link-based Spam Detection - Yahoo! Research
- Web Spam Challenge 2007 Track II - Secure Computing Corporation Research
- Web Spam Detection - Sapienza University of Rome
- WITCH: A New Approach to Web Spam Detection - Google Tech Talks
- Yahoo - Identifying Spam hosts using stacked graphical learning
- Yahoo - Detecting spam hosts based on propagating prediction levels
- Detecting web spam from changes to links of websites - Microsoft
- Method for detecting link spam in hyperlinked databases - Google
- Identifying excessively reciprocal links among web entities - Yahoo
- Link-based spam detection - Yahoo Â
Cloaking and redirection spam
- Cloaking detection utilizing popularity and market value. - Microsoft
- System and method for identifying cloaked web servers - Najork, Marc A.; January 4, 2002 (now with Microsoft)
- Search Ranger System and Double-Funnel Model for Search Spam Analyses and Browser Protection (cloaking) - Microsoft
- Discovering and determining characteristics of network proxies - Yahoo
- Detecting spam documents in a phrase based information retrieval - Google
- Multimedia spam determination using speech conversion - Microsoft
Domain-based spam-resistant ranking - Microsoft
- Content evaluation - Microsoft
Now, if that wasn't everything you ever wanted to know about web spam, then I don't know what is!! :0)