|Google Rankings and LDA|
|Written by David Harry|
|Thursday, 08 July 2010 16:00|
The new LSI for 3rd Generation SEOs?
Or is it the new LSI for engineers? Well, possibly both, we're not sure yet. Let's find out. To start we must ask; what in the world has made LDA the flavour of the day in SEO? That part is simple. Because the folks over at 'the Moz' brought out a new bouncing baby search tool that uses LDA modelling. Why am I on about it? Because I love this sh*t and have always said that we need to be more informed in the art of information retrieval if we're to be more complete providers. Thus I am truly vested in this conversation.
Ok, so what exactly are we looking at here? For starters we must actually get back to the world of LSI... yes, I am going there, as it must be done for context, ( I have generally noted SEOs use of the term LSI to be akin to snake-oil).
Next, let's ride by some brief IR history; it became apparent at some point (2002?) that LSA/I calculations were limited and not capable of dealing with the more complex nature of the web and in particular, large scale search engines. For more on the history of LSI and the SEO world see this post and this one. In short, it's use in modern (large scale) IR was short lived. Thus the evolution to other forms such as pLSA, HTMM and of course LDA, (Latent Dirichlet Allocation) were born. More here on (the basics) of semantic analysis in search. We have heard the term bandied for years and it's just never rung true. Let go of the LSI.... but do understand the core concepts.
Get your geek on!
Back to the history lesson. Once past the world of LSI, many (in the IR world) adopted an approach known as pLSI/A, (probabilistic Latent Semantic Indexing). Some basics (more references at end of post);
While that seemed to be working ok, there were detractors...
All of this, lead to the concepts surrounding LDA (and by extension; pLDA). One of the advantages is that the ordering of words in a document is taken into account. One of the major benefits was said to be the ability to recognize multiple topics on a page that other approaches struggled with, (once more, related reading at the end). So, what we have is a process that indeed seems to have support and is part of the evolutionary chain.
LDA and modern search engines
Ok cool, we have a logical path. But is this the only game in town? Can we truly say that this is at the core of Google (or any other major search engine's) relevance engine? Most certainly not. Yes, Googlers have written about LDA in the past. But they have also looked at Hidden Topic Markov Models, (although it should be noted that LDA can be used with Markov Models), pLSA and even purchased a whack of coding/patents on Phrase Based Information Retrieval.
That last one in fact, phrase based IR, they went as far as to purchase Anna Patterson's work on the topic and her company back in 2004 (she later left and started the search engine; Cuil). This approach I'd take a guess as being used in concert with any of the latent approaches. It looks more at phrases as inferences and goes into a lot of areas beyond it including duplication, links and spam... three areas we know are important at Google, (more on this in the coming weeks)..
Is this strictly a Google thing? Not at all really. The fine folks at Microsoft have also funded research into LDA. Which in itself is also interesting. But the fact remains it is simply ONE particular method of semantic analysis known to the IR world. We cannot infer from anything I have seen that it alone is responsible for determining semantic relevance for any of the major engines.
Another important consideration is that the folks at Google were talking about Open HTMM well after the work on LDA was done, as noted in this 2007 Google Research post.
So, we have LDA which Google was looking at in 2003, Phrase Based in 2004 and Open HTMM, that piqued their interest in 2006/7 – take from that what you will. If you have your head wrapped around all of that, there's an interesting kicker. They do all, for the most part, play well together. Which means another option for more than one approach being used in the scoring. Especially if you say... had some new infrastructure or something. (Although my money's on 'implicit/explicit user feedback').
Beyond Semantic Relevance
Next of course we need to consider the obvious; what do such tools tell us? Well, oddly enough probably very little of a definitive nature. Even if we knew the methods involved we still wouldn't know where the dials are set. We wouldn't know what semantic analysis methods are being used in combination. And we certainly wouldn't know what weight any of this has on the overall scoring schema. One of the IR people I spoke to about it, (and SEOmoz's new tool) had be inundated with this chatter lately, told me;
This certainly does highlight the problem and as Danny (Sullivan) had said on a Sphinn thread about it;
Aye, true 'dat brother. Now we can argue what the 200+ signals are, what their weights are, but that misses the point. What matters is the fact that any signal in isolation with unknown weights in the overall scoring process, makes it all a moot point.
If someone markets a tool that can give insight into the semantic relevance for a given page, then no harm no foul. As long as it is clear that there is no magic bullet, that this is a reasonably arbitrary implementation, go for it. Long ago we learned from 'Slawski U' to speak of patents in terms of 'may be' and 'might be' or 'possibly using'. Tools are great as long as they're not over-sold.
And what of SEOmoz's newest weapon?
Upon reading the release post, I am still sitting on the fence at least. He does temper things well as far as what it may or may not be, I just have the feeling the page was aiming to be semantically relevant for 'LDA' and 'Google'. Oh wait...
Huh, maybe not. (added; this post is but a 61% score) - I have played around the tool some and oddly, had created a little spreadsheet not dissimilar to the one on the announcement post. Results? It's hard to say since I didn't fully research the deeper SERPs, but there are 'some' data points worth looking more at. But I can't call it the reverse engineering of Google's relevance models. A handy tool? Quite possibly. Just don't start wandering off track and losing sight of what it is.
Then there was this, where Rand says;
Now maybe this fellow paper/patent hound would know better, that's just instances of the terms “LDA” and “Google”. That is not 900+ actual papers by Googlers and Bingers on the topic of LDA. Some aren't even search engine related. Just a gaff there I hope. Oui?
OK... enough of that, I'm being cheeky there :0) I most certainly can do little but applaud Rand and Co. At the end of this ride. Why? Because it has shone yet another light on what 2/3rds of the initialism is, Search Engine(s). It has spurned many days of uber geeky chatter in the search space which is ultimately a GREAT thing. We must be able to understand these concepts to ensure we can make realistic observations in such situations in the future.
The real kicker? Is this;
Ok my brother... as long as we have that, we be cool.
It is great to see these level of interest and engagement around these topics. Makes an ol (fire) horse feel all giddy inside. The Moz gang most certainly 'get it'. At least far better than a large portion of search folks I come across. Who can complain about these kinds of convos? Not me... not me.
Until next time, play safe!
ADDED; as for some of the correlation data that the Moz used, I'd have a read of Sean Gollihers post and (Dr.) Edel Garcia's / for some counter points on that end.
Hidden Topic Markov Models
Google News and PLSA - Google News Personalization: Scalable Online Collaborative filtering
Phrase Based IR
Here on the Trail...
|Last Updated on Thursday, 07 October 2010 23:33|
Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC