What's the proper way to handle Allow and Disallow in robots.txt?
One very important note: the Allow statement should come before the Disallow statement, no matter how specific your statements are.
So in your third example - no, the bots won't crawl /norobots/index.html
.
Generally, as a personal rule, I put allow statements first and then I list the disallowed pages and folders.
Here's my take on what I see in those three examples.
Example 1
I would ignore the entire /folder1/
directory except the myfile.html
file. Since they explicitly allow it I would assume it was simply easier to block the entire directory and explicitly allow that one file as opposed to listing every file they wanted to have blocked. If that directory contained a lot of files and subdirectories that robots.txt file could get unwieldy fast.
Example 2
I would assume the /norobots/
directory is off limits and everything else is available to be crawled. I read this as "crawl everything except the /norobots/ directory".
Example 3
Similar to example 2, I would assume the /norobots/
directory is off limits and all .html
files not in that directory is available to be crawled. I read this as "crawl all .html files but do not crawl any content in the the /norobots/ directory".
Hopefully your bot's user-agent contains a URL where they can find out more information about your crawling habits and make removal requests or give you feedback about how they want their robots.txt interpreted.