how to prevent staging to be indexed in search engines
I'm normally against exposing staging servers to the public web, but if that's the best solution for your workflow, here are a few things you can consider:
Minimal Approach
- Create new domain for staging server (e.g. example-stage.com)
- Add robots.txt =>
Disallow: /
- Verify domain in Google & Bing Webmaster Tools
The minimal approach is the very basics to make sure you don't shoot yourself in the foot with having duplicate content everywhere. By registering a separate domain, it's a clean division to the user of what is stage and what isn't. It also is a bit cleaner when you need to move environments around, but that's more operational. CNAMEs will work as well, but remember to register each CNAME with Google and Bing Webmaster Tools. This way you can use the domain removal tool if you need to.
Advised Approach
- Add Authentication (HTTP or otherwise) infront of requests
- Respond with appropriate response code if not permitted (e.g. 401 Unauthorized)
- Everything else in the Basic Approach above
By adding a robots.txt it prevents search engines from accessing and indexing the content. However, that doesn't mean they won't index the URL. If a search engine knows about a given URL, it may add it to the search result index. You'll sometimes see these in the search results. The title tends to be the URL with no description. To prevent this from happening, the search engines need to be told not to show the content or URLs. By adding Authentication infront and not responding with a 200 OK status code it is a strong signal to the engines not to add these URLs to their index. From my experience I haven't ever seen a 401 response code page listed in a search engine index.
Preferred Approach
- Put staging sites behind IP tables (e.g. accessible only from a given IP range)
- Add meta or x-robots commands to each page with a value of NOINDEX, NOFOLLOW
- Everything else in the Advised Approach
By putting the staging sites behind an IP filter ensures that only your clients are able to access the site. This can be a problem if they want to access it from other computers, and sometimes a maintenance headache but it's the best approach if you don't want to get your staging environment indexed. A word of caution, you'll want to make sure that all other requests (e.g. search engines and non-clients), doesn't serve anything back. They should receive a timeout response and never serve a 200 OK. By serving other information, it could be mistaken for cloaking which you won't want.
Additionally to be extra safe, I would also add a meta robots or x-robots-header command to each page to NOINDEX, NOFOLLOW just in case IP tables fails from a misconfiguation or if Authentication ever fails ... it's rare, but it happens when there are people touching the configurations for other reasons. Like the robots.txt file, you can really shoot yourself in the foot with these page level robots commands if they ever get pushed out to production. So just make sure your dev / staging environments are in a cleanly separated configuration. Otherwise pushing out a NOINDEX, NOFOLLOW or a Disallow: /
would be disastrous for your production site.
You can disable this server wide by adding the below setting in globally in apache conf or the same parameters can be used in vhost for disabling it for particular vhost only.
Header set X-Robots-Tag "noindex, nofollow"
Once this is done you can test it by verifying apache headers returned.
curl -I staging.mywebsite.com HTTP/1.1 302 Found Date: Sat, 26 Nov 2016 22:36:33 GMT Server: Apache/2.4.18 (Ubuntu) Location: /pages/ X-Robots-Tag: noindex, nofollow Content-Type: text/html; charset=UTF-8