Has anyone else noticed that there seems to be some changes to the replies one receives when filing a...
What Google Knows & The...
Link Values Simplified
Why Can't We Killl Bad...
In this video interview link building expert Jim Boykin explains[…]
Join Nick Mihailovski and Ikai Lan from the Analytics and[…]
Could you please give details on what should be included[…]
| How to: Scrape search engines without pissing them off |
| Written by Ian Lurie | |
| Wednesday, 28 September 2011 12:43 | |
| You can learn a lot about a search engine by scraping its results. It’s the only easy way you can get an hourly or daily record of exactly what Google, Bing or Yahoo! (you know, back when Yahoo! was a search engine company) show their users. It’s also the easiest way to track your keyword rankings. Like it or not, whether you use a third-party tool or your own, if you practice SEO then you’re scraping search results. If you follow a few simple rules, it’s a lot easier than you think. The problem with scrapingAutomated scraping — grabbing search results using your own ‘bot’— violates every search engine’s terms of service. Search engines sniff out and block major scrapers. If you ever perform a series of searches that match the behavior of a SERP crawler, Google and Bing will interrupt your search with a captcha page. You have to enter the captcha or perform whatever test the page requires before performing another query. That (supposedly) blocks bots and other scripts from automatically scraping lots of pages at once. The reason? Resources. A single automated SERP scraper can perform tens, hundreds or even thousands of queries per second. The only limitations are bandwidth and processing power. Google doesn’t want to waste server cycles on a bunch of sweaty-palmed search geeks’ Python scripts. So, they block almost anything that looks like an automatic query. Your job, if you ever did anything like this, which you wouldn’t, is to buy or create software that does not look like an automatic query. Here are a few tricks my friend told me:
Stay on the right side of the equationNote that I said “almost” anything. The search engines aren’t naive. Google knows every SEO scrapes their results. So does Bing. Both engines have to decide when to block a scraper. Testing shows that equation to make the block/don’t block decision balances:
At a minimum, any SERP bot must have the potential of tying up server resources. If it doesn’t, the search engine won’t waste the CPU cycles. It’s not worth the effort required to block the bot. So, if you’re scraping the SERPs, you need to stay on the right side of that equation: Be so unobtrusive that, even if you’re detected, you’re not worth squashing. DisclaimerUnderstand, now, that everything I talked about in this article is totally hypothetical. I certainly don’t scrape Google. That would violate their terms of service. And I’m sure companies like AuthorityLabs and SERPBuddy have worked out special agreements for hourly scraping of the major search engines. But I have this… friend… who’s been experimenting a bit, testing what’s allowed and what’s not, see… Cough. How I did my testI tested all these theories with three Python scripts. All of them:
Script #1 had no shame. It hit Google as fast as possible and didn’t attempt to behave like a ‘normal’ web browser. Script #2 was a little embarrassed. It pretended to be Mozilla Firefox and only queried Google once every 30 seconds. Script #3 was downright bashful. It selected a random user agent from a list of 10, and paused between queries for anywhere between 15 and 60 seconds. The resultsScript #3 did the best. That’s hardly a surprise. But the difference is:
There’s no way any of these scripts fooled Google. The search engine had to know that scripts 1, 2 and 3 were all scrapers. But it only blocked 1 and 2. My theory: Script 3 created so small a burden that it wasn’t worth it for Google to block it. Just as important, though, was the fact that script 3 didn’t make itself obvious. Detectable? Absolutely. I didn’t rotate IP addresses or do any other serious concealment. But script 3 behaved like it was, well, embarrassed. And a little contrition goes a long way. If you acknowledge you’re scraping on the square and behave yourself, Google may cut you some slack. The rulesBased on all of this, here are my guidelines for scraping results:
Follow all three of these and you’re a well-behaved scraper. Even if Google and Bing figure out what you’re up to, they leave you alone: You’re not a burglar. You’re scouring the gutter for loose change. And they’re OK with that. More articles by this author | |
| Last Updated on Wednesday, 28 September 2011 13:03 |
Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC
| Digital Marketing; Weekly - Issue 3 YouTube Trends Map One of the coolest new Trends tools to be rolled out is YouTube Trends which o [ ... ] | Matt Cutts on upcoming changes for SEO 2013 Matt Cutts had a blog post and video (made in early May) about “What to expect in SEO in the [ ... ] |
Comments
The problem with scraping is when you want to scrape more than one search engine, including local version of that search engine. It takes a lot of time to create the regular expressions that will scrape the result pages and when you think everything is working just fine, the search engine changes something in the results page and you need to start all over.
So scrapping thousands of search engines is a task too large to do manually. A solution to bypass these downsides is to use an automated tool (eg. Advanced Web Ranking).
Anyway, good read and will go experiment a bit myself.
Cheers.
RSS feed for comments to this post