What’s the difference between the robots.txt file and a noindex meta tag? Both of these tags are designed to tell the search engine crawlers to “not index” a URL on a website. But are they really the same? In many situations, adding a disallow directive in the robots.txt file will have the same effect as using a meta noindex tag on a web page. But, there are times when you should NOT use the robots.txt file and use the meta noindex tag, instead. Let’s look at the differences.
The robots.txt file is a file that must be located on your website’s domain name at the root of the domain name. For example, for this site, the robots.txt file is located here: https://searchnewscentral.com/robots.txt. The file shouldn’t be in any other locations. You will, though, need to add a robots.txt file for each subdomain on the domain name that you have, if you have any. For example, if this site set up billhartzer.staging.searchnewscentral.com, the site would need to have a robots.txt file at billhartzer.staging.searchnewscentral.com/robots.txt.
Robots.txt File Disallow
The robots.txt file generally has two purposes: to list the URLs on the website that you do NOT want the search engine crawlers to crawl, and to list the sitemap URLs of the website. For now, let’s focus on the robots.txt file and the disallow directive.
If you want to disallow (stop or not allow) the search engines from indexing a web page on your website, you simply add a line to the robots.txt file like this:
That would stop the search engines from indexing a page on your site called thankyou.html. That’s a good example, because those pages (like thankyou.html) are pages that are presented to the visitor once they fill out a form on your website. So, you generally don’t want those pages indexed by the search engines. If you track the visits to those pages in Google Analytics as goals, then it’s important to make sure that they’re not crawled and indexed.
In the case above, the User-agent: * tells all search engine bots (and other bots that crawl websites) that the below commands (or directives) apply to them. You can add additional lines in the file to tell certain bots to not crawl certain pages. So, if you wanted Google to not crawl a Bing URL than you can do that. Then, you can tell Bing to not crawl a Google URL.
Anyhow, the robots.txt file is where there’s a difference between the robots.txt file and the noindex meta tag. The robots.txt file tells the search engines to not crawl a particular URL (or page) on the website. Note that I said they won’t “CRAWL” that URL (or page) on the website. That also, technically, means that they “WILL” index the URL (or page) if they know about it.
This is an important distinction.
The search engines (mainly Google) may know about a particular URL (or page) on the website. They may have crawled the page before but now you’re telling them not to crawl the URL. There may be other pages on your website (and on other websites) that are linking to that URL. So, Google knows about that URL. But, they don’t have permission (via the robots.txt file) to “CRAWL” the page.
So, what happens if Google knows about a URL yet you tell them not to crawl the URL? This happens:
Google “KNOWS” about that page. But, the robots.txt file tells the search engines to NOT crawl the page. However, starting a few years ago, Google decided that they would, in fact, still index the pages that they aren’t allowed to crawl via the robots.txt file. In some cases, this, in fact, can be an issue. For example, it’s quite possible that a web page like this could rank well in the search results because it’s indexed. However, Google will show only the URL and “No information is available for this page…Learn why” in the search results.
In this case, the example page I showed above, I don’t want that page crawled. I am not really concerned whether or not it’s indexed or not, as the page just contains a form and has no content. So, I’m not concerned that people could find it in the search results.
If you are concerned that someone would find a particular web page or URL in the search results, then do NOT use the robots.txt file to disallow the URL from being crawled.
Meta NoIndex Tag
The meta noindex tag, is just what it sounds like: search engines, when they come across this URL (or page), should NOT index it. They can crawl the page (they just did because they’re allowed to see the noindex tag), but they aren’t allowed to index it. As I mentioned, there are some cases where you don’t want the search engines to list the URL in their index.
The meta noindex tag looks like this:
Google has more information about this tag and it’s usage here.
So, it’s important to understand:
- If you disallow a URL in robots.txt the search engines can’t crawl it. They *WILL* index it, as they can’t see the noindex tag on the page
- If you want a page NOT indexed in the search results, then do NOT add it to the robots.txt file. Only add the meta noindex tag on the page.
There are several types of pages that I generally don’t want crawled, so I will add these in a robots.txt file, from from the beginning of launching the site:
- Admin directories (especially WordPress wp-admin directory)
- Password-protected directories
- Login pages
- Image directories – sometimes, if you have sensitive images you don’t want copied and used on other websites
- PDF directories – if users must fill out a form in order to download a PDF file on your site, then keep the search engines from crawling the directories where the PDFs are located.
In most cases, if you stop the search engines from crawling these directories or pages, right from the beginning of launching a site, then they won’t be crawled and indexed. If they’re really sensitive, though, as they are URLs you don’t want indexed or found, then “ONLY” add a noindex meta tag on the page. And, don’t link to the page on the website at all. If you were to add the disallow directive in the robots.txt file, then it’s quite possible that someone would “snoop around” and manually look at your robots.txt file, looking for files that you tell the search engines “NOT” to index.
There is, in fact, a big difference between the robots.txt file and a meta noindex tag. Each have their uses, especially given the fact that Google completely “IGNORES” the robots.txt file and technically they will not crawl a URL if you tell them not to crawl the URL. But, they “WILL” index it!
without crawling how it is possible to index pages?
By knowing it through internal links
When crawling a page, the bots can follow all links to other pages that are found on that page, when allowed to crawl. But even if disallowed in the robots.txt file, an external link to that page can still allow discovery, which may enable indexing.
Comments are closed.