How does a company like CloudFlare block bot crawling and email harvesters?

CloudFlare serves as a guard between your webserver and the client. Every content the client receives got provided by your webserver and filtered by CloudFlare. This way, CloudFlare obfuscates email addresses by filtering them using a regex before delivering it to the client.

If your website contains the email

<a href="mailto:[email protected]">[email protected]</a>

CloudFlare will replace it with

<a href="/cdn-cgi/l/email-protection#fed8ddcfcfcbc5d8ddc8cac5d8ddcfcfcbc5d8ddc7c7c5d8ddcfcecac5d8ddc7c9c5d8ddcac8c5d8ddc7c6c5d8ddcfccccc5">&#115;&#64;&#115;&#99;&#104;&#97;&#46;&#98;&#122;</a>

The /cdn-cgi/ - folder, though it still points to the webserver, is only for CloudFlare which will automatically filter everything you submit, deobfuscating and returning the correct email address.

Of course this is not bulletproof (this is simply not possible) as a bot can continue on that URL or search for encoded email - patterns. This is a rare occurence and most of todays simple crawlers wont find your email.

You shouldnt rely on this approach - CF is already quite popular and it is easy to detect and deobfuscate those email addresses. Using your own, unique obfuscating techniques is more likely to be safe against intelligent harvesters as it is too much work adapting the crawler for every single obfuscation technique.

Simple bot behaviour and "normal user" behaviour are noticeable different, and most bots tend to be relatively simple, since it works for the majority of sites. For example, consider arriving on Security.SE:

A human loads the page, there is a delay of a few seconds upwards whilst they read the first few questions, then you get a request for a page, followed by browser initiated requests for supporting files (images, scripts, styles). You would then expect a bit of time to pass before a request with that page as the referrer comes in for another page. A more technical user might open several questions if they're using a tabbed browser, but there will be a short pause between these requests (whilst they move the mouse or tab to the next question), then, again, you expect a pause before any manual requests from these pages.
A bot loads the page, and immediately parses it, looking for links/email addresses. You see a large number of requests almost immediately after the page has been sent. Depending on the bot, you may find that supporting files aren't loaded (bot doesn't care about your style). The bot is likely to do the same with links from the pages received then, and keep doing it until it can't find any more links.

These methods can be bypassed with a bit of effort to make a bot look like a human, but that would slow down the crawling process a lot, so dubious bot owners don't seem to bother doing that.

In addition to James' and Matthew's answer (which are both valid points by the way):

Obviously services like CloudFlare have a bunch of detection methods to decide whether or not a client is allowed through their various layers of protection.

They have a lot of information on their website about these features but you probably won't find specific rules and implementation details as this would make detection easier to circumvent.

I guess you could see that it is some kind of bot because you get requests for multiple sites from the same IP. But that would also be the case for many legit IPs, like a VPN. How do you tell the good from the bad?

Anecdotal: I'm indeed often deemed 'suspicious' by CloudFlare whilst connected to a VPN.

I suspect a lot of the factors Matthew mentioned (load time, type of resources requested, pauses before next requests) contribute to CloudFlare not instantly blocking me.
Instead they serve up Google's ReCaptcha to confirm I am not a bot/crawler and let me through afterwards.

More info:
James' answer: E-mail obfuscation
Matthew's answer: Web Application Firewall/WAF

How does a company like CloudFlare block bot crawling and email harvesters?

Tags:

Email

Cdn

Cloudflare

Bot

Related

Recent Posts