Banner

Follow Along

RSS Feed Join Us on Twitter On Facebook

Get Engaged

Banner

Featured Article

Google Reconsideration Requests & Replies; understanding the evolutionGoogle Reconsideration Requests & Replies; understanding the evolutionHas anyone else noticed that there seems to be some changes to the replies one receives when filing a...
Read More >>

Latest Comments

Latest Articles

Ranting can be Therapeutic Ranting can be Therapeutic
We all have our hot buttons - things that just drive us up the...
Read More >>
Avoiding Search Strategy PitfallsAvoiding Search Strategy Pitfalls
Building a digital and search strategy isn’t the easiest thing in the world to...
Read More >>
What Google Knows & The War Against SEOsWhat Google Knows & The...
I don’t know how many marketers know about Google Think. Essentially, Google Think is the...
Read More >>

Our Sponsors

Banner
Banner
Banner

Latest Search Videos

Join Us

Banner
Banner
How to: Scrape search engines without pissing them off
Written by Ian Lurie
Wednesday, 28 September 2011 12:43

You can learn a lot about a search engine by scraping its results. It’s the only easy way you can get an hourly or daily record of exactly what Google, Bing or Yahoo! (you know, back when Yahoo! was a search engine company) show their users. It’s also the easiest way to track your keyword rankings.

SERP Scraping

Like it or not, whether you use a third-party tool or your own, if you practice SEO then you’re scraping search results.

If you follow a few simple rules, it’s a lot easier than you think.

The problem with scraping

Automated scraping — grabbing search results using your own ‘bot’— violates every search engine’s terms of service. Search engines sniff out and block major scrapers.

If you ever perform a series of searches that match the behavior of a SERP crawler, Google and Bing will interrupt your search with a captcha page. You have to enter the captcha or perform whatever test the page requires before performing another query.

That (supposedly) blocks bots and other scripts from automatically scraping lots of pages at once.

The reason? Resources. A single automated SERP scraper can perform tens, hundreds or even thousands of queries per second. The only limitations are bandwidth and processing power. Google doesn’t want to waste server cycles on a bunch of sweaty-palmed search geeks’ Python scripts. So, they block almost anything that looks like an automatic query.

Your job, if you ever did anything like this, which you wouldn’t, is to buy or create software that does not look like an automatic query. Here are a few tricks my friend told me:

 

Stay on the right side of the equation

Note that I said “almost” anything. The search engines aren’t naive. Google knows every SEO scrapes their results. So does Bing. Both engines have to decide when to block a scraper. Testing shows that equation to make the block/don’t block decision balances:

  • Potential server load created by the scraper.
  • The potential load created by blocking the scraper.
  • The query ‘space’ for the search phrase.

At a minimum, any SERP bot must have the potential of tying up server resources. If it doesn’t, the search engine won’t waste the CPU cycles. It’s not worth the effort required to block the bot.

So, if you’re scraping the SERPs, you need to stay on the right side of that equation: Be so unobtrusive that, even if you’re detected, you’re not worth squashing.

Disclaimer

Understand, now, that everything I talked about in this article is totally hypothetical. I certainly don’t scrape Google. That would violate their terms of service.

And I’m sure companies like AuthorityLabs and SERPBuddy have worked out special agreements for hourly scraping of the major search engines. But I have this… friend… who’s been experimenting a bit, testing what’s allowed and what’s not, see…

Cough.

How I did my test

I tested all these theories with three Python scripts. All of them:

  1. Perform a Google search.
  2. Download the first page of results.
  3. Then downloads the next 4 pages.
  4. Saves the pages for parsing.

Script #1 had no shame. It hit Google as fast as possible and didn’t attempt to behave like a ‘normal’ web browser.

Script #2 was a little embarrassed. It pretended to be Mozilla Firefox and only queried Google once every 30 seconds.

Script #3 was downright bashful. It selected a random user agent from a list of 10, and paused between queries for anywhere between 15 and 60 seconds.

The results

Script #3 did the best. That’s hardly a surprise. But the difference is:

  • Script #1 was blocked within 3 searches.
  • Script #2 was blocked within 10 searches.
  • Script #3 was never blocked, and performed 150 searches. That means it pulled 5 pages of ranking data for 150 different keywords.

There’s no way any of these scripts fooled Google. The search engine had to know that scripts 1, 2 and 3 were all scrapers. But it only blocked 1 and 2.

My theory: Script 3 created so small a burden that it wasn’t worth it for Google to block it. Just as important, though, was the fact that script 3 didn’t make itself obvious. Detectable? Absolutely. I didn’t rotate IP addresses or do any other serious concealment. But script 3 behaved like it was, well, embarrassed. And a little contrition goes a long way. If you acknowledge you’re scraping on the square and behave yourself, Google may cut you some slack.

The rules

Based on all of this, here are my guidelines for scraping results:

  1. Scrape slowly. Don’t pound the crap out of Google or Bing. Make your script pause for at least 20 seconds between queries.
  2. Scrape randomly. Randomize the amount of time between queries.
  3. Be a browser. Have a list of typical user agents (browsers). Choose one of those randomly for each query.

Follow all three of these and you’re a well-behaved scraper. Even if Google and Bing figure out what you’re up to, they leave you alone: You’re not a burglar. You’re scouring the gutter for loose change. And they’re OK with that.

Ian Lurie -

Ian Lurie is Chief Marketing Curmudgeon and President at Portent, an internet marketing company he started in 1995. Portent is a full-service internet marketing company whose services include SEO, SEM and strategic consulting. He started practicing SEO in 1997 and has been addicted ever since. Ian rants and raves, with a little teaching mixed in, on his internet marketing blog, Conversation Marketing. He recently co-published the Web Marketing for Dummies All In One Desk Reference. In it, he wrote the sections on SEO, blogging, social media and web analytics.

Also hook up via

Read More >>


More articles by this author

Dear Google: This is warDear Google: This is war
Dear Google: With your announcement yesterday, you've become the enemy. My...
Read More >>
A Scalable Content Strategy in a Post-Panda WorldA Scalable Content Strategy in a Post-Panda World
The Panda has raged across the globe now, leaving a...
Read More >>
Last Updated on Wednesday, 28 September 2011 13:03
 

Comments  

 
+1 #1 Philip Petrescu 2011-09-30 12:08
Great article Ian! Scraping the search engines is a fine art. You need to be careful and use all of your arsenal at disposal. But once you have random user agents, at least 20 seconds between queries and send the request from different IP addresses every time, there is really no way for any search engine to distinguish your script from a normal user.

The problem with scraping is when you want to scrape more than one search engine, including local version of that search engine. It takes a lot of time to create the regular expressions that will scrape the result pages and when you think everything is working just fine, the search engine changes something in the results page and you need to start all over.

So scrapping thousands of search engines is a task too large to do manually. A solution to bypass these downsides is to use an automated tool (eg. Advanced Web Ranking).
Quote
 
 
0 #2 Carlos Fabuel 2011-10-12 22:54
It is right. It is necessary go slowly to prevent block. If you have a lot of IP you can go faster.
Quote
 
 
0 #3 Tomás dt 2011-11-10 18:19
Ok, the key is to use different IP addresses, several user agents and time... Very good post, congratulations !
Quote
 
 
0 #4 Susan 2012-04-23 13:30
Wow, I've been looking for a working method to scrape SERPs, your tips are really helpful, thanks. :-)
Quote
 
 
0 #5 steely dan horse 2012-09-03 23:59
Sooooooo... add proxies and voila? Maybe randomise the times for #2?

Anyway, good read and will go experiment a bit myself.

Cheers.
Quote
 
 
0 #6 Karl 2013-03-05 01:00
Thank you for the tips! 8)
Quote
 
 
0 #7 Gvanto 2013-10-16 00:37
Thank you, great article - it makes sense to use these techniques.

Just curious how sites like seomoz, seocentro.com, etc can do this EN MASS without getting done for TOS violation?
Quote
 
 
0 #8 Larry 2013-11-14 11:04
you can also pay google to use their API and make it 100% legal. As long as I remember it was not expensive (like 1000 query for 1$)
Quote
 
 
0 #9 Benabee 2013-12-20 15:41
Wow this is a great information. I'm writing a little scraper. I'd never had thought that rotating user agents could have so much influence. It makes sense for the pause interval but I'd never thought about user agents.
Quote
 
 
0 #10 John MN 2013-12-30 09:36
Thanks for helpful tips! It really make sense.
Quote
 
 
0 #11 Pejuang Online 2014-01-18 14:47
I have trying this but can't see the result, what is the great tools, softwares, or plugins to scrape google search result into .txt or .csv?
Quote
 
 
0 #12 Emiliano Velasco 2014-04-19 20:28
Thanks for share!
Quote
 

Add comment


Security code
Refresh

Getting Around the Site

Home - all the latest on SNC
SEO - our collection of SEO articles
Technical SEO - for the geeks
Latest News - latest news in search
Analytics - measure up and convert
RSS Rack - feeds from around the industry
Search - looking for something specific?
Authors - Author Login
SEO Training - Our sister site
Contact Us - get in touch with SNC

What's New?