How to: Scrape search engines without pissing them off

You can learn a lot about a search engine by scraping its results. It’s the only easy way you can get an hourly or daily record of exactly what Google, Bing or Yahoo! (you know, back when Yahoo! was a search engine company) show their users. It’s also the easiest way to track your keyword rankings.

SERP Scraping

Like it or not, whether you use a third-party tool or your own, if you practice SEO then you’re scraping search results.

If you follow a few simple rules, it’s a lot easier than you think.

The problem with scraping

Automated scraping — grabbing search results using your own ‘bot’— violates every search engine’s terms of service. Search engines sniff out and block major scrapers.

If you ever perform a series of searches that match the behavior of a SERP crawler, Google and Bing will interrupt your search with a captcha page. You have to enter the captcha or perform whatever test the page requires before performing another query.

That (supposedly) blocks bots and other scripts from automatically scraping lots of pages at once.

The reason? Resources. A single automated SERP scraper can perform tens, hundreds or even thousands of queries per second. The only limitations are bandwidth and processing power. Google doesn’t want to waste server cycles on a bunch of sweaty-palmed search geeks’ Python scripts. So, they block almost anything that looks like an automatic query.

Your job, if you ever did anything like this, which you wouldn’t, is to buy or create software that does not look like an automatic query. Here are a few tricks my friend told me:

 

Stay on the right side of the equation

Note that I said “almost” anything. The search engines aren’t naive. Google knows every SEO scrapes their results. So does Bing. Both engines have to decide when to block a scraper. Testing shows that equation to make the block/don’t block decision balances:

  • Potential server load created by the scraper.
  • The potential load created by blocking the scraper.
  • The query ‘space’ for the search phrase.

At a minimum, any SERP bot must have the potential of tying up server resources. If it doesn’t, the search engine won’t waste the CPU cycles. It’s not worth the effort required to block the bot.

So, if you’re scraping the SERPs, you need to stay on the right side of that equation: Be so unobtrusive that, even if you’re detected, you’re not worth squashing.

Disclaimer

Understand, now, that everything I talked about in this article is totally hypothetical. I certainly don’t scrape Google. That would violate their terms of service.

And I’m sure companies like AuthorityLabs and SERPBuddy have worked out special agreements for hourly scraping of the major search engines. But I have this… friend… who’s been experimenting a bit, testing what’s allowed and what’s not, see…

Cough.

How I did my test

I tested all these theories with three Python scripts. All of them:

  1. Perform a Google search.
  2. Download the first page of results.
  3. Then downloads the next 4 pages.
  4. Saves the pages for parsing.

Script #1 had no shame. It hit Google as fast as possible and didn’t attempt to behave like a ‘normal’ web browser.

Script #2 was a little embarrassed. It pretended to be Mozilla Firefox and only queried Google once every 30 seconds.

Script #3 was downright bashful. It selected a random user agent from a list of 10, and paused between queries for anywhere between 15 and 60 seconds.

The results

Script #3 did the best. That’s hardly a surprise. But the difference is:

  • Script #1 was blocked within 3 searches.
  • Script #2 was blocked within 10 searches.
  • Script #3 was never blocked, and performed 150 searches. That means it pulled 5 pages of ranking data for 150 different keywords.

There’s no way any of these scripts fooled Google. The search engine had to know that scripts 1, 2 and 3 were all scrapers. But it only blocked 1 and 2.

My theory: Script 3 created so small a burden that it wasn’t worth it for Google to block it. Just as important, though, was the fact that script 3 didn’t make itself obvious. Detectable? Absolutely. I didn’t rotate IP addresses or do any other serious concealment. But script 3 behaved like it was, well, embarrassed. And a little contrition goes a long way. If you acknowledge you’re scraping on the square and behave yourself, Google may cut you some slack.

The rules

Based on all of this, here are my guidelines for scraping results:

  1. Scrape slowly. Don’t pound the crap out of Google or Bing. Make your script pause for at least 20 seconds between queries.
  2. Scrape randomly. Randomize the amount of time between queries.
  3. Be a browser. Have a list of typical user agents (browsers). Choose one of those randomly for each query.

Follow all three of these and you’re a well-behaved scraper. Even if Google and Bing figure out what you’re up to, they leave you alone: You’re not a burglar. You’re scouring the gutter for loose change. And they’re OK with that.

15 comments

  • Great article Ian! Scraping the search engines is a fine art. You need to be careful and use all of your arsenal at disposal. But once you have random user agents, at least 20 seconds between queries and send the request from different IP addresses every time, there is really no way for any search engine to distinguish your script from a normal user.

    The problem with scraping is when you want to scrape more than one search engine, including local version of that search engine. It takes a lot of time to create the regular expressions that will scrape the result pages and when you think everything is working just fine, the search engine changes something in the results page and you need to start all over.

    So scrapping thousands of search engines is a task too large to do manually. A solution to bypass these downsides is to use an automated tool (eg. Advanced Web Ranking).

  • It is right. It is necessary go slowly to prevent block. If you have a lot of IP you can go faster.

  • Ok, the key is to use different IP addresses, several user agents and time… Very good post, congratulations!

  • Wow, I’ve been looking for a working method to scrape SERPs, your tips are really helpful, thanks. 🙂

  • Sooooooo… add proxies and voila? Maybe randomise the times for #2?

    Anyway, good read and will go experiment a bit myself.

    Cheers.

  • Thank you for the tips! 8)

  • Browser useragent rotation, IP rotation … seems to be the same thinking behind CL Searcher regarding content scraping on craigslist. So that’s how they do it 8)

  • I’m wondering if you might share your python scripts. I have a list of company names for which I’m trying to obtain the link of the top search result. I’ve set my script to execute every 3-5 minutes, but I’d like to incorporate the use of multiple proxies, so as to shorten that time between queries. Any ideas on how I could do this? I’m new to Python and think you’re code would be very helpful.

    Thanks so much,
    Rick

  • Could we say this is still a valid thing to pursue?

    Great article! 😆

  • bing user policy seems to specify 7 searches per second limit

  • I have been thinking of making my own bot to do this I know this is an old post but times have changed. I wonder if you used selenium to basically grab the data for you as it uses a web browser. currently trying to work something out. all I want to do is crawl page 1 for the key word and past back what position my site is in. serpcloud was useless at updating and serps is bleeding expensive

  • Hi,
    I am not sure what authoritylabs or webceo or ranktrackr or other similar software is doing.

    WebPosition was banned, but why are these software working ?

  • If the script fetched every element of the page like images then it would be harder for search engines to tell if it was a bot.

  • Marketing Consultant

    I also suggest to make a semi-automatic process.

    The robot can crawl and if he detects a captcha, you can show it to an operator in order to resolve the challenge.

  • Webmarketing Expert

    I personnally use an average time between 60/90 seconds when scraping google results, without rotating user agents strings.

    Result : never been blocked by a 503 error 😆

Leave a Reply

Your email address will not be published. Required fields are marked *