Prevent site data from being crawled and ripped

Between this:

What are the measures I can take to prevent malicious crawlers from ripping

and this:

I wouldn't want to block legitimate crawlers all together.

you're asking for a lot. Fact is, if you're going to try and block malicious scrapers, you're going to end up blocking all the "good" crawlers too.

You have to remember that if people want to scrape your content, they're going to put in a lot more manual effort than a search engine bot will... So get your priorities right. You've two choices:

  1. Let the peasants of the internet steal your content. Keep an eye out for it (searching Google for some of your more unique phrases) and sending take-down requests to ISPs. This choice has barely any impact on your apart from the time.
  2. Use AJAX and rolling encryption to request all your content from the server. You'll need to keep the method changing, or even random so each pageload carries a different encryption scheme. But even this will be cracked if somebody wants to crack it. You'll also drop off the face of the search engines and therefore take a hit in traffic of real users.

Good crawlers will follow the rules you specify in your robots.txt, malicious ones will not. You can set up a "trap" for bad robots, like it is explained here: http://www.fleiner.com/bots/.
But then again, if you put your content on the internet, I think it's better for everyone if it's as painless as possible to find (in fact, you're posting here and not at some lame forum where experts exchange their opinions)


Realistically you can't stop malicious crawlers - and any measures that you put in place to prevent them are likely to harm your legitimate users (aside from perhaps adding entries to robots.txt to allow detection)

So what you have to do is to plan on the content being stolen - it's more than likely to happen in one form or another - and understand how you will deal with unauthorized copying.

Prevention isn't possible - and will be a waste of your time trying to make it so.

The only sure way of making sure that the content on a website isn't vulnerable to copying is to unplug the network cable...

To detect it use something like http://www.copyscape.com/ may help.


Any site that it visible by human eyes is, in theory, potentially rippable. If you're going to even try to be accessible then this, by definition, must be the case (how else will speaking browsers be able to deliver your content if it isn't machine readable).

Your best bet is to look into watermarking your content, so that at least if it does get ripped you can point to the watermarks and claim ownership.