How to properly (dis)allow the archive.org bot? Did things change, if so when?
Update: As @KevinFegan notes in the comments, their documentation changed. The below part describes how the Internet Archive handled it in the past (at least in 2014).
Their FAQ How can I have my site's pages excluded from the Wayback Machine? refers to Removing Documents From the Wayback Machine, which documents that their bot is called ia_archiver
.
So this record should allow their bot to crawl your entire site:
User-agent: ia_archiver
Disallow:
There are really 2 issues here:
- Will the
robots.txt
on your site Disallow (block) Wayback from crawling your site. - Will Wayback crawl your site.
For point #1:
As others have said, the correct entry for robots.txt is:
User-agent: ia_archiver
Disallow:
Keep in mind that it might take a while (perhaps a good long while), for Wayback to notice any changes you have made to robots.txt.
To check if the robots.txt
on your site will allow Wayback to crawl your site:
- Go to this URL: https://archive.org/web/
- In the box at the TOP of the page, enter the URL of a page on your site, and click the
"Browse History"
button. - Or, in the box under "Save Page Now" (currently near the bottom on the right), and enter the URL of a page on your site, and click the
"Save Page"
button.
At this point, you should see 1 of 3 things:
- You will see an error message indicating that Wayback can't access pages on that site due to "robots.txt".
- You will see the "calendar" of historical save points for the page on your site. In this case, you know that Wayback is NOT blocked from crawling your site.
- Or, you will see a message indicating that Wayback doesn't have an archive of that page, and an offer to click a link to add the page to Wayback. In this case also, you know that Wayback is NOT blocked from crawling your site.
Now, for point #2:
Will Wayback crawl your site?
Just because you Allow Wayback to crawl your site, doesn't mean that they (ever) will crawl your site.
According to the Wayback FAQ (emphasis added):
How can I get my site included in the Wayback Machine?
Much of our archived web data comes from our own crawls or from Alexa Internet's crawls. Neither organization has a "crawl my site now!" submission process. Internet Archive's crawls tend to find sites that are well linked from other sites. The best way to ensure that we find your web site is to make sure it is included in online directories and that similar/related sites link to you.
Alexa Internet uses its own methods to discover sites to crawl. It may be helpful to install the free Alexa toolbar and visit the site you want crawled to make sure they know about it.
Regardless of who is crawling the site, you should ensure that your site's 'robots.txt' rules and in-page META robots directives do not tell crawlers to avoid your site.
Update: 09-May-2017
Others have left comments/answers indicating that Archive.org no longer honors robots.txt. Perhaps this is a "work-in-progress" and it will eventually be the case, but I have not seen this new behavior yet.
The case for this seems to come from this article: Robots.txt: ROBOTS.TXT IS A SUICIDE NOTE by archiveteam.org
. While that page has little if anything good to say about "Robots.txt", it doesn't mention anywhere that Archive.org will no longer honor robots.txt.
Also of note: that article is hosted on archiveteam.org
, which is most definitely not archive.org
, and I'm not sure there is any (official) relationship between archive.org
and archiveteam.org
.
In fact, this page on archive.org
about Archive Team, seems to declare a distinction between archive.org
and archiveteam.org
(emphasis added):
Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. ...
In any case, I decided to give this a try, and I found that, at least at this time, Archive.org STILL honors robots.txt:
- I found a random item on eBay: Item #: 131795294232
- Click to view the sold items:
- The "Items sold" page opens: http://offer.ebay.com/ws/eBayISAPI.dll?ViewBidsLogin&item=131795294232 Copy the link to the clipboard.
- Goto web.archive.org, and paste the link from eBay.
- You will see that
archive.org
indicates that the "Page cannot be displayed due to robots.txt."
So, at this time, I remain unconvinced, but I would love to be proven wrong... it would be great if it were true.
Update 2017
Archive bot now does not care about your robots.txt.
If you really want to block it, send them a email according to this page, or block their IP address via htaccess.