You are here

Semantic Analysis for SEO: Going Beyond LDA

November 09, 2010
SEO Marketing

Not so long ago the SEO world was quivering with excitement at a new shiny bobble for their lexicon: LDA (Latent Dirichlet Allocation). Yes, you remember that one, don't ya? It all started with the poorly named SEOmoz tool and continued since then ushering in a new era of three-letter initialization snake oil.

But really, this can still be a good moment for the search geeks of the realm, I promise. While it has left many sheeple with new smoke to waft in front of their mirrors, it has done something else far more valuable; it has SEOs talking a little more about the world of information retrieval (IR).

A closer look at semantic analysis

LDA is but one of a wide variety of semantic analysis approaches. Rather that bothering to sort out which ones Google/Bing might be using (yes, more than one can be used in concert) I thought we might just go over WHAT exactly the process is about.

Understanding Semantic Analysis in Modern Search

Semantic analysis (SA) isn't really about synonyms and plurals (stemming) as many folks in the biz seem to believe. If there is any one misconception we hear the most, it is that.

Concepts and theme -- basically the problem with establishing on-page relevance is that computers just don't understand the language very well (a 6th-grade level last I heard). So they use SA to try and better understand what a page is about.

I like to use the example of the search jaguar. This could be a car, a big cat, an operating system, a football team, etc.

semantic equivalents

To better understand what the page is about, they look for terms/phrases that are on the page to categorize it. In the case of the car, we'd find terms/phrases such as auto mechanic, engine and for the animal, short hair, hunts prey and so on.

Let's look at a search for White House. This doesn't necessarily mean the US capital. This might be simply about a "white" "house." So the system would look for things such as President of the United States, Barack Obama and so on ... you get the idea.

Gaining Insight

Now let's consider value; in each category and its subsequent sub-categories there is an occurrence rate (when a certain phrase/term/concept is used, how often it is used in other documents and associated correlation rates).

This same process can be used to identify other common terms that the search engine would "expect" to see. This is much different than a simple keyword density approach, as there would be not only a phrase density expectation but a related phrase occurrence factor as well.

This is based from a seed set of documents and adapted over time (machine learning) via click and query analysis data. Repeating the same term over and over in a high density, with no expected related phrases, would be a futile effort.

Using Semantic Analysis Concepts in SEO

Now we need to consider where this plays into the world of SEO. And one might imagine this is just about the on-page contextual stuff, but this would be incorrect.

One of the other reasons I am more inclined towards phrase-based approaches over LDA (although both may be employed) is that it does far more than mere content analysis. In the various filings they use the system for a variety of elements, including:

  • On-page content analysis: This is a no-brainer. It will look at the content (and TITLE) to establish concepts.

  • Links: This part is interesting in that they also look at A. The anchor of inbound link. B. The semantic relevance of the page the link is on. C. The TITLE of the page the link is on. D. The relevance of the inbound link to the page the link is on.

  • Duplicate content: There are also patents on using this SA approach to do dup detection.

  • Spam: And of course, they can use it in a variety of ways for detecting spam (in concert with other methods)

  • Personalization: Also of interest, they talk about personalization of such a system, meaning a model would be built of "good phrases" for various user types. Also they mention "Personalized Topic-Based Document Descriptions" that uses the system to adapt snippets.

This all makes the phrase-based IR approach very flexible and it offers a lot of value. This alone, or in concert with other SA methods, seems to be a powerful tool with far-reaching implications beyond what most SEOs seem to conceptualize.

For our purposes, we're more interested in the on-page content and of course the inbound links. I'd also venture out far enough to even consider page naming conventions and outbound links, but I've never actually seen that in any of the documentation.

  1. Get a list of supporting concepts for usage in your contentions
  2. Craft TITLE with related terms
  3. Evaluate relevance of pages where you are hoping to get links from
  4. Add strong relevance via outbound links
  5. Use relevant page naming conventions
  6. Do the same for any content drops

Finding the phrases/concepts would be part of the keyword research aspects of the program. We'd not only look at primary terms, secondary terms and modifiers, but also collecting a list of "related phrases."

Tools of the Trade

This is where the real problem exists. There are no tools for this really. The badly named SEOmoz tool is a start, but it doesn't really do what we want it to. What we need would be a tool that:

  • Analyzes top 10-20 ranked pages for most common phrases/terms/concepts
  • Spits back a list of said terms
  • Analyzes your target page, compares it against list and gives back suggestions

Sadly, we've nothing like that. I have been talking to the WordStream gang about developing something like this; time will tell. For now, there is a tool I have used to get close. It is the also poorly named LSI Tool from the folks over at NicheBot.

They have a tool that will search and categorize terms via semantic baskets. Now, let me just say, this isn't the perfect tool. And, as described above, I'd personally build a much different one that uses this type of approach, but also does some SERP analysis. But it's a start maybe our good friends at WordStream here would like to work on it?

So, to give you an idea, here's the main screen for a given project:


And here is the "LSI View":

From there you would start culling the lists depending on the goals of the page/site/project in question. After you have the optimal list, export the whole thing. Once you have that, you can use it in content creation, to give to writers, the link building crew (to add semantic diversity to inbound anchor texts), the folks dealing with on-site such as TITLE and page naming conventions ... the whole bloody team.

Of course, there is an art to this. Since we don't have a tool to properly analyze top query spaces, you must do some research on your own and rely on "instincts" as well. Other tools of interest are listed at the bottom of this post (none really do the trick).

The Value of Understanding Semantic Analysis for SEO

And so, the question ultimately arises: What is the value of all this for the average SEO Joe/Jill?? What is the point of knowing all of this?

I always put it this way: Link building is by far the most difficult and often most expensive aspect of SEO. As such, we want to nail as many on-site factors as possible. One of the more important areas for this is the semantic analysis. As search engines get more powerful, more options will be there to do a better job (less noise) which will likely boost the weight.

As we've seen, semantic analysis can be understood/employed throughout the SEO process, from the site development to the meta data to contextual on page and even link building. To get the most bang for yer buck, this type of approach does pay off in the long run.

Does it work? I sure like to think so. I have, for years, approached my SEO from the standpoint of some type of semantic analysis being in play. Knowing which isn't nearly as important as knowing how to incorporate it into the SEO process.

By giving some thought to how to work within semantic best practices you can certainly increase the value now and (IMO) into the future.

Here's some more reading:

If you have questions ... do sound off in the comments.

David Harry is an SEO and search analyst with Reliable SEO. You can also track him down via Twitter: @theGypsy.


Nov 09, 2010

Good post and interesting little platform from NicheBot. Ok so the point about the Whitehouse how much can the search engines look back in history to understand that not just Barack Obama was the president what about Bill Clinton... how much does this change or shift in relationships confuse the search algorithms? Even look at Greece where they have rules about the white houses on the islands, so there has to be a line on where semantic analysis can stretch and still remain useful and consistent. I'm a fairly strong believer in that it's becoming a stronger signal based on grouping of keyword sets in Google Webmaster Tools but not something I would bet the entire campaign on just yet. The only small point is that by focusing on a few key links and some quality content it does appear to be having a far greater affect that it used too....

Doc Sheldon
Nov 09, 2010

Great piece, David... a really good explanation of how SA works. I suspect we're going to be seeing some impressive advancements coming out in this field in the next year or so. You already know that I agree that RDFa is getting stronger. I think it's going to become a near necessity to remain competitive before too long. And you're spot on with this: "Knowing which isn't nearly as important as knowing how to incorporate it into the SEO process."

Nov 10, 2010

Very interesting post! It may be of interest that I have written a web application which does exactly the required job in semantic analysis and comparison, as well as ranking a cloud of pages and/or hosts. Since there is no public version available (now) please contact per email for a test drive.

Dan @ Keyword Research Service
Nov 10, 2010

Hey Doc, How great to bump into you here, as optimistic as always with your "next year or so" expectations :) I don't think that the advancements in the short term (3-4 years) are going to impress us very much, not to mention turn the industry upside down. As for the long run - this is a slowly swelling tidal wave which should eventually affect all we have come to know.

brian mcfarlane
Feb 15, 2011

Hi David, Since this is my first contribution to the topic. I wanted to address this in more detail. I only use four tools for seo; Domain web studio, the last keyword tool,excel and krakken. Currently there are only 200 users in the world using this system for seo that is based on natural language processing, LSI and common sense. These tools combined allow me to swallow markets in less time with fewer links using silo themed site architecture and a semi-automated site blueprint process that saves me 100's of hours a work a month. So that's my 2 cents worth.

Mar 02, 2011

@mck Very interesting piece of kit! @gypsy & everyone else... If you're interested in "LDA" and "semantic word weather" get in touch with @mck! Very very interesting developments going on there!!!! Not seen anything comparable...although is it still up and running today?

Leave a Comment