Are you ready for the next Penguin assault?
According to comments being reported from SES, Matt Cutt’s has said the next incarnation is the "the father of all Penguins" – "jarring and jolting" and that SEO’s won’t “want the next Penguin update." Strange, considering many of them I know didn’t want the first one. Anyway….
Accordingly many search types are scurrying around talking about it on various communities and social sites. Even my Skype fired up as folks started asking for my take. Not sure I really have one beyond the same mantra I’ve preached for years on this kinda crap (link spam etc).
Wait. Wasn’t this the ‘web spam’ update?
Back when we first got to know this destructive little flightless foul, it was called ‘the web spam‘ update. That’s actually kind of important. This is NOT part and parcel to Panda other than sharing the same project name (Search Quality).
“(…) any deliberate human action that is meant to trigger an unjustifiably favorable relevance or importance for some web page, considering the page’s true value.” (from Web Spam Taxonomy, Stanford – PDF)
They went on to note,
"Most SEOs claim that spamming is only increasing relevance for queries not related to the topic(s) of the page. At the same time, many SEOs endorse and practice techniques that have an impact on importance scores to achieve what they call "ethical" web page positioning or optimization. Please note that according to our definition, all types of actions intended to boost ranking, without improving the true value of a page, are considered spamming." (emphasis mine)
As such, if we truly want to get a sense of what has been happening or what may be on the horizon, we need to get a bit more of an understanding of various tactics search engines use to combat SEOs… erm… I mean web spam.
Boosting and hiding
In simplest terms we can break it down into two camps; Boosting techniques, and Hiding techniques, (from my guide to webspam);
Boosting, is just like it sounds and some examples are;
Term Spamming: This would be those seeking to manipulate through elements such as the page TITLE (title spam), Meta Description or Meta Keywords (meta spam). As most of us know, two out of three of those were abused to the point where most modern search engines don’t use them as signals at all.
URL Spamming is another area they’ve been known to also look at. Yup, strange as it sounds, because there is some weight given to URLs by some search engines, it can be considered to be a manipulation.
Link Spamming is another well-known one that also includes anchor text spamming. Search engines consider not only the mass of link spam, but also the anchor text as this is one of the more important signals from a ranking perspective. This section obviously also includes when spammers seek to drop links on pages to increase a target pages value (forums, comments, guest books, etc.) and obviously the more nefarious hack and drop techniques.
Hiding is more about tricking the engines with various on-site tactics. Some of these include;
Content hiding: These are techniques where terms and links are hidden when the browser renders a page. The more common approaches are using colour schemes that render the elements in question effectively invisible.
Cloaking: We all know this one right? This is when one identifies a search engine crawler and seeks to show a different version of the page to the spider than it would for the average user. This, one assumes, cuts down on the changes of being reported by users or competitors that might otherwise see the spammy page.
Redirection: The page is automatically redirected by the browser in the same manner so that the page gets indexed by the engine, but the user will never actually see it. Essentially acting as a proxy/doorway to game the engine, and misdirect the users.
Getting the idea? I’d never advise taking one’s eye off the ball as far as looking for simple answers. If anything, that’s what got a lot of SEOs into the dung heap to start with. But I digress…
Potential On Site Penguin Issues
Given that these ones are seemingly less the focus for Google, we’ll just look at a few that I haven’t seen mentioned a whole lot, that might actually be part of the algo.
Language: they might treat different languages on levels. Research has shown that French, German and English tend to have higher levels of spam. There could be trust elements to the updates.
Top Level Domain; domains such as .INFO and .BIZ traditionally have higher levels of spam. This could lower a trust score for these.
Words per page: apparently the sweet spot for spammers is 750-1500 words. Could there be a classifier to look at this?
Keywords in page TITLE: a classic boosting technique and research has found spam pages contain far more keywords than non-spammy (classified) pages.
Amount of anchor text: they might look at the ratios of text to anchor text. Interestingly, that approach could be Panda related as well.
Compressibility: As a mechanism used to fight KW stuffing, search engines can also look at compression ratios. Or more specifically, repetitious or content spinning.
Host-level spam: looking at other domains on the server and/or registrar levels. Certainly easily found networks are on the table one would have to imagine.
Phrase-based: With this approach, a probabilistic learning model using training documents looks for textual anomalies in the form of related phrases. This is like KW stuffing on steroids. Looking for statistical anomalies can often highlight spammy documents.
Outgoing links: a website might link out to well-known pages seeking to raise their ‘hub score’ (see TrustRank concepts earlier). Although any use of this I’d imagine a low threshold to deal with false positives (think of sites that scrape entire sections of Wikipedia).
And most certainly one can look around their CMS to ensure they’re not cloaking, sending odd redirects, hiding text and so on. Obviously we’d never do that knowingly right? That’s what I thought.
Potential Linking Penguin Issues
Certainly this area is the one getting the most attention after the first two rounds. And I don’t mind that, it is just more about needing to think beyond popular theories (anchor text comes to mind). So let’s look at some things that might be in the mix;
TrustRank: better known as neighbourhoods (more here). Good sites link to good sites, spam sites (generally) link to other spammy (feeder) sites. Trust in general, does seem to be involved in the Penguin evolution. (also see harmonic rank)
Link stuffing: while it can be used on-site, it is the concept of creating tons of pages that have a link pointing to a given target page. This may be site-wides, or multiple domains as well as on-site. In fact, to a degree the practice of low level directories for SEO could play here as could forums, link spam, widgets and infographics.
Nepotistic links: the well known usual suspects such as paid links, link exchanges and their ilk. We certainly do know that Google isn’t much of a fan of this type of approach. We can surely go out on a limb and infer many of these types of link spam are in consideration.
Topological spamming (link farms): search engines will often compare % of links in the graph against known entities (‘good sites’). Do you have a disproportionate number of links compared to those in your query space(s)? It may be an issue.
Temporal anomalies: better known as ‘link velocity’ and ‘link decay’ to most. Again, when looking at relative pages in the index, spam pages will generally stand out. Those manufacturing links will have a different graph than those considered ‘good pages’ to Google.
Anchor text spam: it’s no secret that those trying to manipulate hold a high value here. As with other thresholds mentioned, this can be compared to other sites in the query space(s) considered ‘good sites’ as part of a seed set. This one has certainly seen play since Penguin launched.
Expired domains; when the spammer buys expiring domains that have link equity, to point to the target site(s). Or simply replacing, changing the content on the domain to take advantage of the existing equity.
And within each of these, there are plenty of concepts they may employ. I encourage you to read at least ONE of the papers/patents listed at the end of this post. It could be enlightening ;0)
A Matter of Levels
Right then, enough of the geekery. The main goal here was to start and think beyond the everyday. Stop trying to nail down what Penguin is (or will become) with things like ‘anchor text ratios’ and ‘networks’. If this truly is a web spam update, then there’s a lot more on the table for Cutts and Co to chew on.
I like to consider much of this in terms thresholds. It’s a common theme. We might consider;
- What are the number of flags being satisfied?
- What are the ratios of these factors in my market/query space(s)?
- Does the site take a major or a minor hit?
- How do ‘unnatural linking’ messages (manual) differ?
The fact we have (for now) a definitive line between Penguin and manual actions (Webmaster Tools messages), I would also imagine that Penguin itself may still have relatively lower thresholds.
Whatever does happen next, most SEOs that have been paying attention, should be fine.
Man, I kill myself…. weeeeeeeeeeeeeee
ADDED; I have also published a post on Search Metrics to help you diagnose a Google Penguin problem. Though it worth sharing here as well.
Seriously, I’ve been saying it for years… but REALLY, read a few of these. Realize just how much there is to this. It wasn’t the ‘anchor text’ update, it’s a web spam update. Aight?
Web Spam Research Papers
- Spam Double-Funnel: Connecting Web Spammers with Advertisers – the Search Ranger system
- Detecting Spam Web Pages through Content Analysis – Microsoft
- Improving web spam classification using rank-time features – (AIRWeb 2007)
- Adversarial Information Retrieval on the Web – (AIRWeb 2007)
- Web Spam Detection Using Decision Trees – Indian Institute of Information Technology
- Web Spam Detection: link-based and content-based techniques – Yahoo
- Web spam Identification Through Content and Hyperlinks – Yahoo
- Combating Web Spam with TrustRank – Stanford 2004
- Propagating Trust and Distrust to Demote Web Spam – Lehigh University
- Recognizing Nepotistic Links on the Web – B.Davison
- Detecting nepotistic links by language model disagreement
- Link Spam Alliances – Stanford
- Know your Neighbors: Web Spam Detection using the Web Topology – Yahoo
- Identifying excessively reciprocal links among web entities – Yahoo (patent)
- Link Based Small Sample Learning for Web Spam Detection – Chinese Academy of Sciences
- Undue influence: eliminating the impact of link plagiarism on web search rankings – B Wu, BD Â
- Detecting link spam using temporal information – Microsoft
- Extracting link spam using biased random walks from spam seed sets – B Wu, K Chellapilla
- Link Analysis for Web Spam Detection – Yahoo Research
- Link Spam Detection Based on Mass Estimation – Stanford
- Link Based Characterization and Detection of Web Spam – Yahoo
- The Anti-Social Tagger – Detecting Spam in Social Bookmarking Systems – AirWeb
- An Empirical Study on Selective Sampling in Active Learning for Splog Detection – AIRweb
- Identifying Video Spammers in Online Social Networks – Polytechnic University
- Social Spam Detection – Indiana University
- Web spam identification through language model analysis – AIRweb
- Detecting spam web pages through content analysis – Microsoft
- Exploring Linguistic Features for Web Spam Detection: A Preliminary Study – Various authors