How to block search engines from indexing all urls beginning with origin.domainname.com
You can rewrite robots.txt
to an other file (let's name this 'robots_no.txt' containing:
User-Agent: *
Disallow: /
(source: http://www.robotstxt.org/robotstxt.html)
The .htaccess file would look like this:
RewriteEngine On
RewriteCond %{HTTP_HOST} !^www.example.com$
RewriteRule ^robots.txt$ robots_no.txt
Use customized robots.txt for each (sub)domain:
RewriteEngine On
RewriteCond %{HTTP_HOST} ^www.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^sub.example.com$ [OR]
RewriteCond %{HTTP_HOST} ^example.com$ [OR]
RewriteCond %{HTTP_HOST} ^www.example.org$ [OR]
RewriteCond %{HTTP_HOST} ^example.org$
# Rewrites the above (sub)domains <domain> to robots_<domain>.txt
# example.org -> robots_example.org.txt
RewriteRule ^robots.txt$ robots_${HTTP_HOST}.txt [L]
# in all other cases, use default 'robots.txt'
RewriteRule ^robots.txt$ - [L]
Instead of asking search engines to block all pages on for pages other than www.example.com
, you can use <link rel="canonical">
too.
If http://example.com/page.html
and http://example.org/~example/page.html
both point to http://www.example.com/page.html
, put the next tag in the <head>
:
<link rel="canonical" href="http://www.example.com/page.html">
See also Googles article about rel="canonical"