Facebook crawler with no user agent spamming our site in possible DoS attack

Sources say that Facebook/Externalhit does not respect crawl-delay in robots.txt because Facebook doesn't use a crawler, it uses a scraper.

Whenever one of your pages is shared on Facebook, it scrapes your site for your meta title, description and image.

My guess is that if Facebook is scraping your site 11,000 times in 15 minutes then I think the most likely scenario is that someone has figured out how to abuse the Facebook scraper to DDOS your site.

Perhaps they are running a bot that is clicking on your share link over and over, and Facebook is scraping your page every time that it does.

Off the top of my head, the first thing that I would try to do is cache the pages that Facebook is scraping. You can do this in htaccess. This will hopefully tell Facebook not to load your page with every share until the cache expires.

Because of your issue, I would set the html expiry to longer than usual

In .htaccess:

<IfModule mod_expires.c> 
  ExpiresActive On
  ExpiresDefault "access plus 60 seconds"
  ExpiresByType text/html "access plus 900 seconds"

</IfModule>

Setting html to expire at 900 seconds will hopefully prevent Facebook from crawling any individual page at more than once per 15 minutes.


Edit: I ran a quick search and found a page written a few years ago that discusses the very issue you're encountering now. This person discovered that websites could be flooded by the Facebook scraper through its share feature. He reported it to Facebook but they chose to do nothing about it. Perhaps the article will more clearly tell you what is happening to you and maybe it can lead you in the right direction as to how you'd like to deal with the situation:

http://chr13.com/2014/04/20/using-facebook-notes-to-ddos-any-website/


https://developers.facebook.com/bugs/1894024420610804

Per the answer from Facebook, any page shared on Facebook should expect that if their content is shared Facebook crawlers will increase traffic 10-20x that number of shares.

This sounds like Facebook is scraping the content every single time it's accessed, with little to no caching in place.

In our case, while Facebook is probably good for advertising overall, this is an immense strain when you run a database intensive page that's shared. We'll have to rate limit the traffic on our end to prevent a denial of service attack. A resource intensive answer to Facebook's over active bot.