A Google ‘History’ Lesson
If there is one area that all SEOs really should know about, it’s about how search engines deal with ‘document freshness’. In the world of Google, we’d know this in terms of real time/social search and of course, the not so often discussed, query deserves freshness, (a.ka. the QDF). We all know the interest in timely content in the SERPs, but does it end there? Is it all about first to market? Not at all… so let’s take a close look.
What is the QDF all about? – essentially what it comes down to is a small glitch in link valuated systems such as Google’s PageRank. In a lot of cases, most actually, a newer document is going to have less links than an older one. This, as you might imagine, can be a problem in some query spaces, (especially entertainment and sports). Essentially a QDF approach seeks to help deal with this shortcoming. But I am getting ahead of myself, first a little history (pun most certainly intended).
While going through the latest patent awards recently, I came across a familiar Google offering;
System and methods for determining document freshness
Filed; June 30 2004 Awarded Sept 14 2010
It was interesting as it was previously filed/awarded and this was it’s second appearance on the radar. Previously we looked at it’s sister document, also a re-release, back in 2008. It was curious that these both had been re-worked. Ok, fine, maybe that part is only interesting to me. A distinct possibility. We can get back to that later.
This became more interesting in that near completion of this post, Ol Matt mentioned one of the series in a GWC video. (He worked on one of the patents in this line – links at the end);
So hey, now that you’re warmed up, let’s take this opportunity to look once more at historical elements in Google search. Peek under the hood and see what is in there.
How temporal data plays into search
At first glance I am sure you might be thinking, ooooooh this is gonna be about Twitter and realtime and Caffeine right? But that’s not the real deal here. Yes, they have been adapted, but this ride started in 2004, well before the current social onslaught. Let us first take a refresher in what types of data they might be looking for.
- document inception dates;
- document content updates/changes;
- query analysis;
- link-based criteria;
- anchor text;
- user behaviour;
- domain-related information;
- ranking history;
- user maintained/generated data (e.g., bookmarks and/or favorites);
- unique words, bigrams, and phrases in anchor text;
- linkage of independent peers;
- and/or document topics.
What is important to note is that beyond the simple historical publication knowledge, they might also use these types of signals to better understand links, fight webspam and even for personalization. It is certainly no one trick pony.
As an example, let us consider #3 and #9 from above, (Query & Ranking data). For one potential scoring element, we might look at the query history of a term, the ranking history for your page is that space and related click data. We might then ask;
- What is this page’s historical rankings V CTR data?
- How does this match up against other pages returned (over time)?
- What other queries does it show up for? (topic relation/diversity)
- Has this page had significant ranking increase recently (spam detection)
- And if above is ‘yes’, then does the associated query data show quality (spam detection)
But Dave, WAIT! – I hear you say? That this model of simple collection is totally spammable? Most certainly. Maybe we latch on #4, (link based) and cross reference some data to ensure it’s a good match. Obviously link velocity and decay would tell a story as well. You see where we’re headed here?
The main point is to always consider how a group of signals such as these might interact together. It also highlights how fast, with so many systems, one can get up to the mystical 200+ (or so? or 300? meh) ranking factors at Google.
Determining freshness of a page
Which of course brings us back to the recent (re) award. How exactly does Google try and determine what is an isn’t new out there? To expand on Bill’s original round up, items of interest in determining freshness include;
- When It is first crawled by the search engine; a no brainer. One signal can be when the page was first discovered by ol Google.
- When it is first submitted to the search engine; this one might be just a bit dated, so we can pass on this one.
- When the search engine first discovers a link to the document; another given. Using this as a signal is certainly up Google’s alley. If it ain’t got a link, it don’t exist approach.
- When the Domain was registered; certainly a signal that can be useful for domain scoring. Not so much on the document level
- When the page was first referenced in another document; once more, like links, non-link citations might be of use.
- When a document first reaches a certain number of pages; expanding on links and citations, setting a velocity metric.
- By the time stamp of the document on the server it is hosted upon; more obscure, but still a possible signal.
As with any approach we have to consider other elements and inherent scoring mechanisms that may be in play. From what we can see here, there is certainly a likely combination of discovery, crawling (internal site equity), citations and link (velocity). Keep in mind, this is about temporal data. This is about discovery. It doesn’t ensure a ‘fresh‘ page is indexed, never mind it ranking for anything of meaning.
The Social Connection
And of course we’re all left to wonder; how does social play into this? Right? Of course. We live in a world of wonders (for the old dog SEOs) where getting the content out there is easier than ever. Thanks to the social web. If you’re looking to break the latest news, be first to market with a product, launch a new service, social can rock.
What’s great about it all is these sites often have some pretty good link equity (Twitter = lots of links in + very few going out). This of course means they are heavily crawled as fresh content + equity means minty fresh crawl rates.
But not only can we increase the temporal discovery rates we can also develop (non link) citation velocity and even hit aspects related to user behaviour. Yes, in theory, this system can use the Google Social Graph to watch various user type interactions for search personalization.
There is even mention of explicit feedback with ‘user generated data’, (bookmarks etc..). Personalization is an interesting potential use for this… Anyway, yes, there is certainly some ways it can be used for social (makes you wonder why Google’s realtime search isn’t better… huh..)
Other handy SEO tips
At the end of the day, we want to learn something from all of this. Before we get to some tips, let us consider again the QDF. It is important to understand that some query types/spaces are going to react differently than others. Sometimes it can be obvious (historical documents, past events etc..) that haven’t had fresh activity. Other times, not so much. This will be part of the art of SEO. Getting intimate with a query space to understand the dynamic.
Some spaces are more prone to the QDF than others – know the differences. Beyond that, here are some simple take-away items from this or any related approach;
Content Strategy and scheduling – you really do need to have a more formal content program in place to properly take advantage of temporal signals. This also means supporting content (to work on indented and implicit domain SERP listings). Target the query spaces with military precision.
Content update schedule – if you have target pages that are past their prime, (for QDF and social) ensure that there is a quarterly page update. Older pages don’t generally have as much activity and so small changes can be a signal to the search engine that the page is still viable. It takes a few moments and is good for users as well.
Time out your promotions – as we learned, citations, non-link ones even, can be of value. Build out velocity of PR and Social activities in tandem with SEO related goals. This will obviously increase the changes of not only grabbing SERP ground, but holding it.
Spread out link building efforts (link velocity and decay) – as with above, we will also want to ensure we support the tail end of the buzz with more (stable) link building to hold any ground taken (rankings). Also be wary of acquisition that is too fast, it may trigger spam bots to come have a close look.
Enable syndication channels – did we mention social? Well, let’s go beyond that. At the core of social is the ability to have feeds. This is by and large the way of the web these days, so be sure to spread the word by getting your feeds in strong locales. This will aid in discovery, citations and of course, the potential for QDF inclusion. Push baby…PUSH!
Stay on top of new developments – if we’re targeting the QDF aspects, be sure to have an ear to the ground and be as close (and comprehensive) to any breaking stories as possible. Then use some of the velocity tips above to hold the ground you’ve taken.
More than meets the eye
As always, I am passing along these tidbits to get the gears in motion. To help SEOs think beyond myopic thinking of links, links, links.. Yes, this area as with most, has link relations, but there one can see the rest of the forest as well. Take all that we can learn from this and put it back into the SEO stew. A deeper understanding of temproal elements in search can only help you build smarter programs.
I hope you enjoyed the ride… leave any questions/thoughts in the comments and we can continue the dialogue.
Past Google Patents of Interest;
- Information retrieval based on historical data (filed Dec 2003 published March 2008)
- Systems and methods for determining document freshness (filed June 2003 published June 2005).
- Document scoring based on document inception date (including our pal Matt Cutts filed Nov 2006 published April 2007)
- Document scoring based on document content update (filed Nov 2006 published April 2007)
Document scoring based on query analysis (filed Nov 2006 published April 2007)
Document scoring based on traffic associated with a document (filed Nov 2006 published April 2007)
Document scoring based on link-based criteria (filed Nov 2006 published April 2007)