Is there a way to disallow crawling of only HTTPS in robots.txt?

There is no way to do it in robots.txt itself as served over HTTP.

You could serve a different robots file entirely for secure HTTPS connections. Here is one of doing so using rewrite rules in your .htaccess file:

RewriteEngine On
RewriteCond %{HTTPS} =on
RewriteRule ^robots.txt$ robots-deny-all.txt [L]

Where robots-deny-all.txt has the contents:

User-agent: *
Disallow: /

Before you try to manipulate robots.txt, ensure that you have defined canonical link elements on your pages.

Web crawlers should treat:

<link rel="canonical" href="…" />

as a very strong hint that two pages should be considered to have the same content, and that one of the URLs is the preferred address for the content.

As stated in RFC 6596 Section 3:

The target (canonical) IRI MAY:

…

Have different scheme names, such as "http" to "https"…

With the canonical link hints, a reasonably intelligent web crawler should be able to avoid crawling the site a second time over HTTPS.

Is there a way to disallow crawling of only HTTPS in robots.txt?

Tags:

Https

Robots.Txt

Related

Recent Posts