How to stop certain urls from being indexed
There are 2 main ways to prevent search engines from indexing specific pages:
- A Robots.txt file for your domain.
- The Meta Robots tag on each page.
Robots.txt should be your first stop for URL patterns that match several files. You can see the syntax here and more detailed here. The robots.txt file must be placed in the root folder of your domain, i.e. at http://www.yourdomain.com/robots.txt
, and it would contain something like:
User-agent: *
Disallow: /path/with-trailing-slash/
(The text coloring above is done by the Stackexchange software, and should be ignored.)
The Meta Robots tag is more flexible and capable, but must be inserted in every page you want to affect.
Again Google has a overview of how to use Meta Robots, and how to get pages removed from their index via Webmaster Tools. Wikipedia has more comprehensive documentation on Meta Robots, including the search engine specific derivations.
If you want to prohibit Google, The Web Archive and other search engines from keeping a copy of your webpage, then you want the following tag (shown in HTML4 format):
<meta name="robots" content="noarchive">
To prevent indexing and keeping a copy:
<meta name="robots" content="noindex, noarchive">
And to prevent both of the above, as well as using links on the page to find more pages to index:
<meta name="robots" content="noindex, nofollow, noarchive">
NB 1: All 3 above meta tags are for search engines alone -- they do not impact HTTP proxies or browsers.
NB 2: If you already have pages indexed and archived, and you block pages via robots.txt while at the same time adding the meta tag to the same pages, then the robots.txt will prevent search engines from seeing the updated meta tag.
There's actually a third way to prevent Google and other search engines from indexing URLs. It's the X-Robots-Tag
HTTP Response Header. This is better then meta tags because it works for all documents and you can have more then one tag.
The REP META tags give you useful control over how each webpage on your site is indexed. But it only works for HTML pages. How can you control access to other types of documents, such as Adobe PDF files, video and audio files and other types? Well, now the same flexibility for specifying per-URL tags is available for all other files type.
We've extended our support for META tags so they can now be associated with any file. Simply add any supported META tag to a new X-Robots-Tag directive in the HTTP Header used to serve the file. Here are some illustrative examples: Don't display a cache link or snippet for this item in the Google search results: X-Robots-Tag: noarchive, nosnippet Don't include this document in the Google search results: X-Robots-Tag: noindex Tell us that a document will be unavailable after 7th July 2007, 4:30pm GMT: X-Robots-Tag: unavailable_after: 7 Jul 2007 16:30:00 GMT
You can combine multiple directives in the same document. For example: Do not show a cached link for this document, and remove it from the index after 23rd July 2007, 3pm PST: X-Robots-Tag: noarchive X-Robots-Tag: unavailable_after: 23 Jul 2007 15:00:00 PST
If your goal is for this pages to not be seen by the public, it's best to put a password on this set of pages. And/or have some configuration that only allows specific, whitelisted addresses able to access the site (this can be done at the server level, likely via your host or server admin).
If your goal is to have these pages exist, just not indexed by Google, or other search engines, as others have mentioned, you do have a few options, but I think it's important to distinguish between the two main functions of Google Search in this sense: Crawling and Indexing.
Crawling vs. Indexing
Google crawls your site, Google indexes your site. The crawlers find pages of your site, the indexing is organizing the pages of your site. More information on this a bit here.
This distinguishing is important when trying to block or remove pages from Google's "Index". Many people default to just blocking via robots.txt, which is a directive telling Google what (or what not) to crawl. It's often assumed that if Google doesn't crawl your site, it's unlikely to index it. However, it's extremely common to see pages blocked by robots.txt, indexed in Google.
Directives to Google & Search Engines
These type of "directives" are merely recommendations to Google on which part of your site to crawl, and index. They're not required to follow them. This is important to know. I've seen many devs over the years think that they can just block the site via robots.txt, and suddenly the site is being indexed in Google a few weeks later. If someone else links to the site, or if one of Google's crawlers somehow gets a hold of it, it can still be indexed.
Recently, with GSC (Google Search Console)'s updated dashboard, they have this report called the "Index Coverage Report." New data is available to webmasters here that's not been directly available before, specific details on how Google handles a certain set of pages. I've seen and heard of many websites receiving "Warnings," labeled "Indexed, but blocked by Robots.txt."
Google's latest documentation mentions that if you want pages out of the index, add noindex nofollow tags to it.
Remove URLs Tool
Just to build on what some others have mentioned about the "Remove URL's Tool"....
If the pages are indexed already, and it's urgent to get them out, Google's "Remove URLs Tool" will allow you to "temporarily" block pages from search results. The request lasts 90 days, but I've used it to have pages removed quicker from Google than using noindex, nofollow, kind of like an extra layer.
Using the "Remove URLs Tool," Google still will crawl the page, and possibly cache it, but while you're using this feature, you can add the noindex nofollow tags, so it sees them, and by the time the 90 days are up, it'll hopefully know not to index your page anymore.
IMPORTANT: Using both robots.txt and noindex nofollow tags are somewhat conflicting signals to Google.
The reason is, if you tell google not to crawl a page, and then you have noindex nofollow on that page, it may not crawl to see the noindex nofollow tag. It can then be indexed through some other method (whether a link, or whatnot). The details on why this happens are rather vague, but I've seen it happen.
In short, in my opinion, the best way to stop specific URLs from being indexed are to add a noindex nofollow tag to those pages. With that, make sure that you're not blocking those URLs also with robots.txt, as that could prevent Google from properly seeing those tags. You can leverage the Remove URLs from Google tool to temporarily hide them from search results while Google processes your noindex nofollow.