|Are you ready for the next Penguin assault?|
|Written by David Harry|
|Monday, 20 August 2012 12:44|
According to comments being reported from SES, Matt Cutt's has said the next incarnation is the "the father of all Penguins" - "jarring and jolting" and that SEO's won't “want the next Penguin update." Strange, considering many of them I know didn't want the first one. Anyway....
Accordingly many search types are scurrying around talking about it on various communities and social sites. Even my Skype fired up as folks started asking for my take. Not sure I really have one beyond the same mantra I've preached for years on this kinda crap (link spam etc).
Wait. Wasn't this the 'web spam' update?
Back when we first got to know this destructive little flightless foul, it was called 'the web spam' update. That's actually kind of important. This is NOT part and parcel to Panda other than sharing the same project name (Search Quality).
They went on to note,
As such, if we truly want to get a sense of what has been happening or what may be on the horizon, we need to get a bit more of an understanding of various tactics search engines use to combat SEOs... erm... I mean web spam.
Boosting and hiding
In simplest terms we can break it down into two camps; Boosting techniques, and Hiding techniques, (from my guide to webspam);
Boosting, is just like it sounds and some examples are;
Hiding is more about tricking the engines with various on-site tactics. Some of these include;
Getting the idea? I'd never advise taking one's eye off the ball as far as looking for simple answers. If anything, that's what got a lot of SEOs into the dung heap to start with. But I digress...
Potential On Site Penguin Issues
Given that these ones are seemingly less the focus for Google, we'll just look at a few that I haven't seen mentioned a whole lot, that might actually be part of the algo.
Language: they might treat different languages on levels. Research has shown that French, German and English tend to have higher levels of spam. There could be trust elements to the updates.
Top Level Domain; domains such as .INFO and .BIZ traditionally have higher levels of spam. This could lower a trust score for these.
Words per page: apparently the sweet spot for spammers is 750-1500 words. Could there be a classifier to look at this?
Keywords in page TITLE: a classic boosting technique and research has found spam pages contain far more keywords than non-spammy (classified) pages.
Amount of anchor text: they might look at the ratios of text to anchor text. Interestingly, that approach could be Panda related as well.
Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning.
Host-level spam: looking at other domains on the server and/or registrar levels. Certainly easily found networks are on the table one would have to imagine.
Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.
Outgoing links: a website might link out to well-known pages seeking to raise their 'hub score' (see TrustRank concepts earlier). Although any use of this I'd imagine a low threshold to deal with false positives (think of sites that scrape entire sections of Wikipedia).
And most certainly one can look around their CMS to ensure they're not cloaking, sending odd redirects, hiding text and so on. Obviously we'd never do that knowingly right? That's what I thought.
Potential Linking Penguin Issues
Certainly this area is the one getting the most attention after the first two rounds. And I don't mind that, it is just more about needing to think beyond popular theories (anchor text comes to mind). So let's look at some things that might be in the mix;
TrustRank: better known as neighbourhoods (more here). Good sites link to good sites, spam sites (generally) link to other spammy (feeder) sites. Trust in general, does seem to be involved in the Penguin evolution. (also see harmonic rank)
Link stuffing: while it can be used on-site, it is the concept of creating tons of pages that have a link pointing to a given target page. This may be site-wides, or multiple domains as well as on-site. In fact, to a degree the practice of low level directories for SEO could play here as could forums, link spam, widgets and infographics.
Nepotistic links: the well known usual suspects such as paid links, link exchanges and their ilk. We certainly do know that Google isn't much of a fan of this type of approach. We can surely go out on a limb and infer many of these types of link spam are in consideration.
Topological spamming (link farms): search engines will often compare % of links in the graph against known entities ('good sites'). Do you have a disproportionate number of links compared to those in your query space(s)? It may be an issue.
Temporal anomalies: better known as 'link velocity' and 'link decay' to most. Again, when looking at relative pages in the index, spam pages will generally stand out. Those manufacturing links will have a different graph than those considered 'good pages' to Google.
Anchor text spam: it's no secret that those trying to manipulate hold a high value here. As with other thresholds mentioned, this can be compared to other sites in the query space(s) considered 'good sites' as part of a seed set. This one has certainly seen play since Penguin launched.
Expired domains; when the spammer buys expiring domains that have link equity, to point to the target site(s). Or simply replacing, changing the content on the domain to take advantage of the existing equity.
A Matter of Levels
Right then, enough of the geekery. The main goal here was to start and think beyond the everyday. Stop trying to nail down what Penguin is (or will become) with things like 'anchor text ratios' and 'networks'. If this truly is a web spam update, then there's a lot more on the table for Cutts and Co to chew on.
I like to consider much of this in terms thresholds. It's a common theme. We might consider;
The fact we have (for now) a definitive line between Penguin and manual actions (Webmaster Tools messages), I would also imagine that Penguin itself may still have relatively lower thresholds.
Whatever does happen next, most SEOs that have been paying attention, should be fine.
Man, I kill myself.... weeeeeeeeeeeeeee
ADDED; I have also published a post on Search Metrics to help you diagnose a Google Penguin problem. Though it worth sharing here as well.
Seriously, I've been saying it for years... but REALLY, read a few of these. Realize just how much there is to this. It wasn't the 'anchor text' update, it's a web spam update. Aight?
Web Spam Research Papers
|Last Updated on Monday, 27 August 2012 13:36|
Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC