Semantic Analysis for SEO: Going Beyond LDA
Not so long ago the SEO world was quivering with excitement at a new shiny bobble for their lexicon: LDA (Latent Dirichlet Allocation). Yes, you remember that one, don't ya? It all started with the poorly named SEOmoz tool and continued since then ushering in a new era of three-letter initialization snake oil.
But really, this can still be a good moment for the search geeks of the realm, I promise. While it has left many sheeple with new smoke to waft in front of their mirrors, it has done something else far more valuable; it has SEOs talking a little more about the world of information retrieval (IR).
LDA is but one of a wide variety of semantic analysis approaches. Rather that bothering to sort out which ones Google/Bing might be using (yes, more than one can be used in concert) I thought we might just go over WHAT exactly the process is about.
Understanding Semantic Analysis in Modern Search
Semantic analysis (SA) isn't really about synonyms and plurals (stemming) as many folks in the biz seem to believe. If there is any one misconception we hear the most, it is that.
Concepts and theme -- basically the problem with establishing on-page relevance is that computers just don't understand the language very well (a 6th-grade level last I heard). So they use SA to try and better understand what a page is about.
I like to use the example of the search jaguar. This could be a car, a big cat, an operating system, a football team, etc.
To better understand what the page is about, they look for terms/phrases that are on the page to categorize it. In the case of the car, we'd find terms/phrases such as auto mechanic, engine and for the animal, short hair, hunts prey and so on.
Let's look at a search for White House. This doesn't necessarily mean the US capital. This might be simply about a "white" "house." So the system would look for things such as President of the United States, Barack Obama and so on ... you get the idea.
Now let's consider value; in each category and its subsequent sub-categories there is an occurrence rate (when a certain phrase/term/concept is used, how often it is used in other documents and associated correlation rates).
This same process can be used to identify other common terms that the search engine would "expect" to see. This is much different than a simple keyword density approach, as there would be not only a phrase density expectation but a related phrase occurrence factor as well.
This is based from a seed set of documents and adapted over time (machine learning) via click and query analysis data. Repeating the same term over and over in a high density, with no expected related phrases, would be a futile effort.
Using Semantic Analysis Concepts in SEO
Now we need to consider where this plays into the world of SEO. And one might imagine this is just about the on-page contextual stuff, but this would be incorrect.
One of the other reasons I am more inclined towards phrase-based approaches over LDA (although both may be employed) is that it does far more than mere content analysis. In the various filings they use the system for a variety of elements, including:
On-page content analysis: This is a no-brainer. It will look at the content (and TITLE) to establish concepts.
Links: This part is interesting in that they also look at A. The anchor of inbound link. B. The semantic relevance of the page the link is on. C. The TITLE of the page the link is on. D. The relevance of the inbound link to the page the link is on.
Duplicate content: There are also patents on using this SA approach to do dup detection.
Spam: And of course, they can use it in a variety of ways for detecting spam (in concert with other methods)
Personalization: Also of interest, they talk about personalization of such a system, meaning a model would be built of "good phrases" for various user types. Also they mention "Personalized Topic-Based Document Descriptions" that uses the system to adapt snippets.
This all makes the phrase-based IR approach very flexible and it offers a lot of value. This alone, or in concert with other SA methods, seems to be a powerful tool with far-reaching implications beyond what most SEOs seem to conceptualize.
For our purposes, we're more interested in the on-page content and of course the inbound links. I'd also venture out far enough to even consider page naming conventions and outbound links, but I've never actually seen that in any of the documentation.
- Get a list of supporting concepts for usage in your contentions
- Craft TITLE with related terms
- Evaluate relevance of pages where you are hoping to get links from
- Add strong relevance via outbound links
- Use relevant page naming conventions
- Do the same for any content drops
Finding the phrases/concepts would be part of the keyword research aspects of the program. We'd not only look at primary terms, secondary terms and modifiers, but also collecting a list of "related phrases."
Tools of the Trade
This is where the real problem exists. There are no tools for this really. The badly named SEOmoz tool is a start, but it doesn't really do what we want it to. What we need would be a tool that:
- Analyzes top 10-20 ranked pages for most common phrases/terms/concepts
- Spits back a list of said terms
- Analyzes your target page, compares it against list and gives back suggestions
Sadly, we've nothing like that. I have been talking to the WordStream gang about developing something like this; time will tell. For now, there is a tool I have used to get close. It is the also poorly named LSI Tool from the folks over at NicheBot.
They have a tool that will search and categorize terms via semantic baskets. Now, let me just say, this isn't the perfect tool. And, as described above, I'd personally build a much different one that uses this type of approach, but also does some SERP analysis. But it's a start maybe our good friends at WordStream here would like to work on it?
So, to give you an idea, here's the main screen for a given project:
And here is the "LSI View":
From there you would start culling the lists depending on the goals of the page/site/project in question. After you have the optimal list, export the whole thing. Once you have that, you can use it in content creation, to give to writers, the link building crew (to add semantic diversity to inbound anchor texts), the folks dealing with on-site such as TITLE and page naming conventions ... the whole bloody team.
Of course, there is an art to this. Since we don't have a tool to properly analyze top query spaces, you must do some research on your own and rely on "instincts" as well. Other tools of interest are listed at the bottom of this post (none really do the trick).
The Value of Understanding Semantic Analysis for SEO
And so, the question ultimately arises: What is the value of all this for the average SEO Joe/Jill?? What is the point of knowing all of this?
I always put it this way: Link building is by far the most difficult and often most expensive aspect of SEO. As such, we want to nail as many on-site factors as possible. One of the more important areas for this is the semantic analysis. As search engines get more powerful, more options will be there to do a better job (less noise) which will likely boost the weight.
As we've seen, semantic analysis can be understood/employed throughout the SEO process, from the site development to the meta data to contextual on page and even link building. To get the most bang for yer buck, this type of approach does pay off in the long run.
Does it work? I sure like to think so. I have, for years, approached my SEO from the standpoint of some type of semantic analysis being in play. Knowing which isn't nearly as important as knowing how to incorporate it into the SEO process.
By giving some thought to how to work within semantic best practices you can certainly increase the value now and (IMO) into the future.
Here's some more reading:
- Phrase-Based IR on Reliable SEO
- Phrase-Based IR resources
- Another post on phrases
- Understanding Semantic Search on SEJ
- Some stuff from Bill on semantic analysis
- Understanding LDA
- Bing does phrases too, on SEJ
If you have questions ... do sound off in the comments.
David Harry is an SEO and search analyst with Reliable SEO. You can also track him down via Twitter: @theGypsy.