The new LSI for 3rd Generation SEOs?
Or is it the new LSI for engineers? Well, possibly both, we’re not sure yet. Let’s find out. To start we must ask; what in the world has made LDA the flavour of the day in SEO? That part is simple. Because the folks over at ‘the Moz‘ brought out a new bouncing baby search tool that uses LDA modelling. Why am I on about it? Because I love this sh*t and have always said that we need to be more informed in the art of information retrieval if we’re to be more complete providers. Thus I am truly vested in this conversation.
Ok, so what exactly are we looking at here? For starters we must actually get back to the world of LSI… yes, I am going there, as it must be done for context, ( I have generally noted SEOs use of the term LSI to be akin to snake-oil).
Next, let’s ride by some brief IR history; it became apparent at some point (2002?) that LSA/I calculations were limited and not capable of dealing with the more complex nature of the web and in particular, large scale search engines. For more on the history of LSI and the SEO world see this post and this one. In short, it’s use in modern (large scale) IR was short lived. Thus the evolution to other forms such as pLSA, HTMM and of course LDA, (Latent Dirichlet Allocation) were born. More here on (the basics) of semantic analysis in search. We have heard the term bandied for years and it’s just never rung true. Let go of the LSI…. but do understand the core concepts.
Get your geek on!
Back to the history lesson. Once past the world of LSI, many (in the IR world) adopted an approach known as pLSI/A, (probabilistic Latent Semantic Indexing). Some basics (more references at end of post);
“…. models each word in a document as a sample from a mixture model, where the mixture components are multi-nomial random variables that can be viewed as representations of “topics.” Thus each word is generated from a single topic, and different words in a document may be generated from different topics. Each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics. This distribution is the “reduced description” associated with the document.”
While that seemed to be working ok, there were detractors…
“….. it is incomplete in that it provides no probabilistic model at the level of documents. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. This leads to several problems: (1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting, and (2) it is not clear how to assign probability to a document outside of the training set.”
All of this, lead to the concepts surrounding LDA (and by extension; pLDA). One of the advantages is that the ordering of words in a document is taken into account. One of the major benefits was said to be the ability to recognize multiple topics on a page that other approaches struggled with, (once more, related reading at the end). So, what we have is a process that indeed seems to have support and is part of the evolutionary chain.
LDA and modern search engines
Ok cool, we have a logical path. But is this the only game in town? Can we truly say that this is at the core of Google (or any other major search engine’s) relevance engine? Most certainly not. Yes, Googlers have written about LDA in the past. But they have also looked at Hidden Topic Markov Models, (although it should be noted that LDA can be used with Markov Models), pLSA and even purchased a whack of coding/patents on Phrase Based Information Retrieval.
That last one in fact, phrase based IR, they went as far as to purchase Anna Patterson’s work on the topic and her company back in 2004 (she later left and started the search engine; Cuil). This approach I’d take a guess as being used in concert with any of the latent approaches. It looks more at phrases as inferences and goes into a lot of areas beyond it including duplication, links and spam… three areas we know are important at Google, (more on this in the coming weeks)..
Is this strictly a Google thing? Not at all really. The fine folks at Microsoft have also funded research into LDA. Which in itself is also interesting. But the fact remains it is simply ONE particular method of semantic analysis known to the IR world. We cannot infer from anything I have seen that it alone is responsible for determining semantic relevance for any of the major engines.
Another important consideration is that the folks at Google were talking about Open HTMM well after the work on LDA was done, as noted in this 2007 Google Research post.
“It differs notably from others in that, rather than treat each document as a single “bag of words,” it imposes a temporal Markov structure on the document. In this way, it is able to account for shifting topics within a document, and in so doing, provides a topic segmentation within the document, and also seems to effectively distinguish among multiple senses that the same word may have in different contexts within the same document. “ – Google Research Blog
So, we have LDA which Google was looking at in 2003, Phrase Based in 2004 and Open HTMM, that piqued their interest in 2006/7 – take from that what you will. If you have your head wrapped around all of that, there’s an interesting kicker. They do all, for the most part, play well together. Which means another option for more than one approach being used in the scoring. Especially if you say… had some new infrastructure or something. (Although my money’s on ‘implicit/explicit user feedback’).
Beyond Semantic Relevance
Next of course we need to consider the obvious; what do such tools tell us? Well, oddly enough probably very little of a definitive nature. Even if we knew the methods involved we still wouldn’t know where the dials are set. We wouldn’t know what semantic analysis methods are being used in combination. And we certainly wouldn’t know what weight any of this has on the overall scoring schema. One of the IR people I spoke to about it, (and SEOmoz’s new tool) had be inundated with this chatter lately, told me;
“(re; LDA) It’s actually sorta similar to PLSI and yes, impossible to say how or why or where it is used. It’s a well known, old method. It is taught to undergrads in data mining. I think (ed – SEOmoz’s) Ben has good dev skills and has researched and looked at something interesting, but I don’t see how it helps SEO. ……
There is tons of this code available, you don’t need some special tool or software. This is yet another fire that will rush through the SEO community and we’ll be hearing about it for years. LDA is interesting but in relation to SEO, Boring 🙂 — Unless of course I have failed to see the light and this is indeed how Google works. ”
This certainly does highlight the problem and as Danny (Sullivan) had said on a Sphinn thread about it;
“I did get the “content relevancy” part. But given that links far outweigh on the page content, so what? Google has over 200 different factors. Page titles. Content on the page. Use of bold. Links to a page. Quality of those links. Age of domain. Trust of that domain. The linkage and trust of a domain continues to be what most people find trumps every other factor.”
Aye, true ‘dat brother. Now we can argue what the 200+ signals are, what their weights are, but that misses the point. What matters is the fact that any signal in isolation with unknown weights in the overall scoring process, makes it all a moot point.
If someone markets a tool that can give insight into the semantic relevance for a given page, then no harm no foul. As long as it is clear that there is no magic bullet, that this is a reasonably arbitrary implementation, go for it. Long ago we learned from ‘Slawski U‘ to speak of patents in terms of ‘may be‘ and ‘might be‘ or ‘possibly using‘. Tools are great as long as they’re not over-sold.
And what of SEOmoz’s newest weapon?
Upon reading the release post, I am still sitting on the fence at least. He does temper things well as far as what it may or may not be, I just have the feeling the page was aiming to be semantically relevant for ‘LDA‘ and ‘Google‘. Oh wait…
‘Relevance of “www.seomoz.org/blog/lda-and-googles-rankings-well-correlated” to “google and lda”: 55% ‘
Huh, maybe not. (added; this post is but a 61% score) – I have played around the tool some and oddly, had created a little spreadsheet not dissimilar to the one on the announcement post. Results? It’s hard to say since I didn’t fully research the deeper SERPs, but there are ‘some’ data points worth looking more at. But I can’t call it the reverse engineering of Google’s relevance models. A handy tool? Quite possibly. Just don’t start wandering off track and losing sight of what it is.
Then there was this, where Rand says;
“There are hundreds of papers from Google and Microsoft (Bing) researchers around LDA-related topics, too, for those interested. Reading through some of these, you can see that major search engines have almost certainly built more advanced models to handle this problem. ”
Now maybe this fellow paper/patent hound would know better, that’s just instances of the terms “LDA” and “Google”. That is not 900+ actual papers by Googlers and Bingers on the topic of LDA. Some aren’t even search engine related. Just a gaff there I hope. Oui?
OK… enough of that, I’m being cheeky there :0) I most certainly can do little but applaud Rand and Co. At the end of this ride. Why? Because it has shone yet another light on what 2/3rds of the initialism is, Search Engine(s). It has spurned many days of uber geeky chatter in the search space which is ultimately a GREAT thing. We must be able to understand these concepts to ensure we can make realistic observations in such situations in the future.
The real kicker? Is this;
“…this is not us “reversing the algorithm.” We may have built a great tool for improving the relevancy of your pages and helping to judge whether topic modeling is another component in the rankings” – Rand Fishkin
Ok my brother… as long as we have that, we be cool.
It is great to see these level of interest and engagement around these topics. Makes an ol (fire) horse feel all giddy inside. The Moz gang most certainly ‘get it‘. At least far better than a large portion of search folks I come across. Who can complain about these kinds of convos? Not me… not me.
Until next time, play safe!
ADDED; as for some of the correlation data that the Moz used, I’d have a read of Sean Gollihers post and (Dr.) Edel Garcia’s / for some counter points on that end.
- Latent Dirichlet Allocation, Blei et al., JMLR (3), 2003.
- pLDA on Google Code
- Fast collapsed gibbs sampling for latent dirichlet allocation, Porteous et al., KDD 2008.
- Distributed Inference for Latent Dirichlet Allocation, Newman et al., NIPS 2007.
- Collaborative Filtering for Orkut Communities: Discovery of User Latent Behavior. Wen-Yen Chen et al., WWW 2009.
- PLDA: Parallel Latent Dirichlet Allocation for Large-scale Applications – (direct link)
- Code – A quick start manual for pLDA
Hidden Topic Markov Models
- Hidden Topic Markov Models – Hebrew University of Jerusalem
- Open HTMM on Google Code
- Amit’s talk on HTMM (video) – Google Tech Talks
- oHTMM Announcement on Google Research Blog
Google News and PLSA – Google News Personalization: Scalable Online Collaborative filtering
Phrase Based IR
Here on the Trail…
- What you need to know about phrase based optimization
- Phrase Based Optimization Resources
- Google granted a very Cuil patent
- Phrase-based personalization of searches in an information retrieval system
- Phrase Identification in an Information Retrieval System,
- Phrase-Based Generation of Document Descriptions,
- Phrase-Based Searching in an Information Retrieval System,
- Automatic Taxonomy Generation in Search Results Using Phrases,
- Phrase-based indexing in an information retrieval system
- Multiple index based information retrieval system
- Detecting spam documents in a phrase based information retrieval