Follow Along

RSS Feed Join Us on Twitter On Facebook

Get Engaged


Featured Article

Getting a grip on social signals in searchGetting a grip on social signals in  searchIf there's one thing that has driven me nuts over the last 6 months it's the non-stop chatter and search...

Latest Comments

Latest Articles

Will Googles Agent Rank Ever Become a Ranking Factor?Will Google's Agent Rank Ever...
I've seen some interesting discussions recently on the question of whether authority (Agent Rank)...
Algorithm Updates vs Manual Penalties - Some People Still Don’t Get ItAlgorithm Updates vs Manual Penalties...
In the fallout of the last publicly announced (sorta) Panda update and as the...
3 Quick Fixes to Enterprise-Level Technical SEO3 Quick Fixes to Enterprise-Level...
As Google continues to transpose the idea and essence of the real world, physical marketplace...

Our Sponsors


Latest Search Videos

Join Us

Are you ready for the next Penguin assault?
Written by David Harry
Monday, 20 August 2012 04:44

According to comments being reported from SES, Matt Cutt's has said the next incarnation is the "the father of all Penguins" - "jarring and jolting" and that SEO's won't “want the next Penguin update." Strange, considering many of them I know didn't want the first one. Anyway....

Accordingly many search types are scurrying around talking about it on various communities and social sites. Even my Skype fired up as folks started asking for my take. Not sure I really have one beyond the same mantra I've preached for years on this kinda crap (link spam etc).

Google Penguin Assault

Wait. Wasn't this the 'web spam' update?

Back when we first got to know this destructive little flightless foul, it was called 'the web spam' update. That's actually kind of important. This is NOT part and parcel to Panda other than sharing the same project name (Search Quality).

Here's how SEO was defined in one Stanford paper;

“(...) any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page's true value.” (from Web Spam Taxonomy, Stanford - PDF)

They went on to note,

"Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming." (emphasis mine)

As such, if we truly want to get a sense of what has been happening or what may be on the horizon, we need to get a bit more of an understanding of various tactics search engines use to combat SEOs... erm... I mean web spam.

Boosting and hiding

In simplest terms we can break it down into two camps; Boosting techniques, and Hiding techniques, (from my guide to webspam);

Boosting, is just like it sounds and some examples are;

  • Term Spamming: This would be those seeking to manipulate through elements such as the page TITLE (title spam), Meta Description or Meta Keywords (meta spam). As most of us know, two out of three of those were abused to the point where most modern search engines don't use them as signals at all.

  • URL Spamming is another area they've been known to also look at. Yup, strange as it sounds, because there is some weight given to URLs by some search engines, it can be considered to be a manipulation.

  • Link Spamming is another well-known one that also includes anchor text spamming. Search engines consider not only the mass of link spam, but also the anchor text as this is one of the more important signals from a ranking perspective. This section obviously also includes when spammers seek to drop links on pages to increase a target pages value (forums, comments, guest books, etc.) and obviously the more nefarious hack and drop techniques.

Boosting techniques

Hiding is more about tricking the engines with various on-site tactics. Some of these include;

  • Content hiding: These are techniques where terms and links are hidden when the browser renders a page. The more common approaches are using colour schemes that render the elements in question effectively invisible.

  • Cloaking: We all know this one right? This is when one identifies a search engine crawler and seeks to show a different version of the page to the spider than it would for the average user. This, one assumes, cuts down on the changes of being reported by users or competitors that might otherwise see the spammy page.

  • Redirection: The page is automatically redirected by the browser in the same manner so that the page gets indexed by the engine, but the user will never actually see it. Essentially acting as a proxy/doorway to game the engine, and misdirect the users.

Hiding web spam techniques

Getting the idea? I'd never advise taking one's eye off the ball as far as looking for simple answers. If anything, that's what got a lot of SEOs into the dung heap to start with. But I digress...

Potential On Site Penguin Issues

Given that these ones are seemingly less the focus for Google, we'll just look at a few that I haven't seen mentioned a whole lot, that might actually be part of the algo.

Language: they might treat different languages on levels. Research has shown that French, German and English tend to have higher levels of spam. There could be trust elements to the updates.

Top Level Domain; domains such as .INFO and .BIZ traditionally have higher levels of spam. This could lower a trust score for these.

Words per page: apparently the sweet spot for spammers is 750-1500 words. Could there be a classifier to look at this?

Keywords in page TITLE: a classic boosting technique and research has found spam pages contain far more keywords than non-spammy (classified) pages.

Amount of anchor text: they might look at the ratios of text to anchor text. Interestingly, that approach could be Panda related as well.

Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning.

Host-level spam: looking at other domains on the server and/or registrar levels. Certainly easily found networks are on the table one would have to imagine.

Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.

Outgoing links: a website might link out to well-known pages seeking to raise their 'hub score' (see TrustRank concepts earlier). Although any use of this I'd imagine a low threshold to deal with false positives (think of sites that scrape entire sections of Wikipedia).

And most certainly one can look around their CMS to ensure they're not cloaking, sending odd redirects, hiding text and so on. Obviously we'd never do that knowingly right? That's what I thought.

Moving along...

Potential Linking Penguin Issues

Certainly this area is the one getting the most attention after the first two rounds. And I don't mind that, it is just more about needing to think beyond popular theories (anchor text comes to mind). So let's look at some things that might be in the mix;

TrustRank: better known as neighbourhoods (more here). Good sites link to good sites, spam sites (generally) link to other spammy (feeder) sites. Trust in general, does seem to be involved in the Penguin evolution. (also see harmonic rank)

Link stuffing: while it can be used on-site, it is the concept of creating tons of pages that have a link pointing to a given target page. This may be site-wides, or multiple domains as well as on-site. In fact, to a degree the practice of low level directories for SEO could play here as could forums, link spam, widgets and infographics.

Nepotistic links: the well known usual suspects such as paid links, link exchanges and their ilk. We certainly do know that Google isn't much of a fan of this type of approach. We can surely go out on a limb and infer many of these types of link spam are in consideration.

Topological spamming (link farms): search engines will often compare % of links in the graph against known entities ('good sites'). Do you have a disproportionate number of links compared to those in your query space(s)? It may be an issue.

Temporal anomalies: better known as 'link velocity' and 'link decay' to most. Again, when looking at relative pages in the index, spam pages will generally stand out. Those manufacturing links will have a different graph than those considered 'good pages' to Google.

Anchor text spam: it's no secret that those trying to manipulate hold a high value here. As with other thresholds mentioned, this can be compared to other sites in the query space(s) considered 'good sites' as part of a seed set. This one has certainly seen play since Penguin launched.

Expired domains; when the spammer buys expiring domains that have link equity, to point to the target site(s). Or simply replacing, changing the content on the domain to take advantage of the existing equity.

And within each of these, there are plenty of concepts they may employ. I encourage you to read at least ONE of the papers/patents listed at the end of this post. It could be enlightening ;0)

A Matter of Levels

Right then, enough of the geekery. The main goal here was to start and think beyond the everyday. Stop trying to nail down what Penguin is (or will become) with things like 'anchor text ratios' and 'networks'. If this truly is a web spam update, then there's a lot more on the table for Cutts and Co to chew on.

I like to consider much of this in terms thresholds. It's a common theme. We might consider;

  • What are the number of flags being satisfied?
  • What are the ratios of these factors in my market/query space(s)?
  • Does the site take a major or a minor hit?
  • How do 'unnatural linking' messages (manual) differ?

The fact we have (for now) a definitive line between Penguin and manual actions (Webmaster Tools messages), I would also imagine that Penguin itself may still have relatively lower thresholds.

Whatever does happen next, most SEOs that have been paying attention, should be fine.

Man, I kill myself.... weeeeeeeeeeeeeee


ADDED; I have also published a post on Search Metrics to help you diagnose a Google Penguin problem. Though it worth sharing here as well.

More stuff

Seriously, I've been saying it for years... but REALLY, read a few of these. Realize just how much there is to this. It wasn't the 'anchor text' update, it's a web spam update. Aight?

Web Spam Research Papers

TrustRank Concepts

Link Spam

Social Spam

Language/Semantic related

David Harry -

Hi my name is Dave and I, am an algo-holic

I am an avid search geek that spends most of his time reading about and playing with search engines. My main passion has always been about the technical side of things from a strong perspective rooted in IR and related technologies.You can find me providing SEO consulting services for Verve Developments.

You can also hook up with me via


More articles by this author

Google on Guest Blogging; be afraid, be very afraidGoogle on Guest Blogging; be afraid, be very afraid
So here we are again huh? It was just a...
Google Hacks & Dorks for fun and  profitGoogle Hacks & Dorks for fun and profit
Recently someone was asking me about Google's advanced operators and...
Last Updated on Monday, 27 August 2012 05:36


0 #1 Ralph du Plessis 2012-08-21 05:02
Thanks for this Dave, in particular the 'More Stuff" links.
I still can't see how this can be as aggressive as they make out other than to devalue all the obvious link building tactics like directories, sitewide and column links etc. One could argue that disproportionat e anchortext is also an easy one, but how do they differentiate between those that have become synonymous with a product or which have the product (anchor text) in their name e.g. "cheap flights".
One can't help wondering whether Matt Cutts' reference to G+ signals "not requiring too much attention from SEOs yet" could be a red herring?
0 #2 Rafael Montilla 2012-08-21 05:46
This is a master post, I meant a master SEO class....

Thanks David!
0 #3 David Harry 2012-08-21 06:35
@Ralph - I am not a huge fan of the social signals and SEO concept (beyond the logged in state) as I voiced in a recent SEW article;

As for the anchors, that part is more about named entities. Google is pretty good with that stuff. It is natural to see the domain name, brand or product names in anchor texts. (more on entities here; )

At the end of the day I simply wanted to ensure folks weren't myopic on what might be in play. It pays to think outside the box and had more SEOs done so before Penguin, there might be less pain hee hee

@Raf - lol... very good. Not sure about that, likely more a post about information retrieval. It's translating that into SEO and yer everyday activities that is the art
0 #4 Michiel Van Kets 2012-08-23 04:38
well, Harry
I kind of agree with you, on the social signals and the login, but you can still use public pages in those social sites; profiles, and use those to link to/from your site and get shares etc.

while even more so; those public pages can be linked with other profiles, and by linking ... associating ...them all together you have an easy, convenient way of getting links between all of your off-page content, especially natural shares and likes, not so much to gain value, but to pass it on, to spread more, while at the same time bundling the value on the profile pages, that's where you then put your embedded anchor text link; on the profile pages - with unique text/bio/profiles for all of them of course.

there's not much to be gained from those social sites themselves, but you can use them to pass on value from other profiles to those pages you want to give an extra push, you know like a bridge or ... a link ;-)

Add comment

Security code

Getting Around the Site

Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC

What's New?

All content and images copyright Search News Central 2014
SNC is a Verve Developments production, the Forensic SEO Specialists- where Gypsies roam.