How to block search engines from indexing all urls beginning with origin.domainname.com

You can rewrite robots.txt to an other file (let's name this 'robots_no.txt' containing:

User-Agent: *
Disallow: /

(source: http://www.robotstxt.org/robotstxt.html)

The .htaccess file would look like this:

RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$
RewriteRule ^robots.txt$ robots_no.txt

Use customized robots.txt for each (sub)domain:

RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^sub.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.org$ [OR]
RewriteCond %{HTTP_HOST} ^example.org$
# Rewrites the above (sub)domains <domain> to robots_<domain>.txt
# example.org -> robots_example.org.txt
RewriteRule ^robots.txt$ robots_${HTTP_HOST}.txt [L]
# in all other cases, use default 'robots.txt'
RewriteRule ^robots.txt$ - [L]

Instead of asking search engines to block all pages on for pages other than www.example.com, you can use <link rel="canonical"> too.

If http://example.com/page.html and http://example.org/~example/page.html both point to http://www.example.com/page.html, put the next tag in the <head>:

<link rel="canonical" href="http://www.example.com/page.html">

See also Googles article about rel="canonical"

How to block search engines from indexing all urls beginning with origin.domainname.com

Tags:

Url Rewriting

.Htaccess

Robots.Txt

Related

Recent Posts