How to recognize bots with php?

You can check the User Agent string, empty strings, or strings containing 'robot', 'spider', 'crawler', 'curl' are likely to be robots.

preg_match('/robot|spider|crawler|curl|^$/i', $_SERVER['HTTP_USER_AGENT']));


You should filter by user-agent strings. You can find a list of about 300 common user-agents given by bots here: http://www.robotstxt.org/db.html Running through that list and ignoring bot user-agents before you run your SQL statement should solve your problem for all practical purposes.

If you don't want the search engines to even reach the page, use a basic robots.txt file to block them.


We've a similar use-case to yourself, and one option we've recently found quite helpful is the UASParser class from user-agent-string.info.

It's a PHP class which pulls the latest set of user agent string definitions and caches them locally. The class can be configured to pull the definitions as often or as rarely as you deem fit. Automatically fetching them like this means that you don't have to keep on top of the various changes to bot user agents or new ones coming on the market, although you are relying on UAS.info to do this accurately.

When the class is called, it parses the current visitor's user agent and returns an associative array breaking out the constituent parts, e.g.

Array
(
    [typ] => browser
    [ua_family] => Firefox
    [ua_name] => Firefox 3.0.8
    [ua_url] => http://www.mozilla.org/products/firefox/
    [ua_company] => Mozilla Foundation
    ........
    [os_company] => Microsoft Corporation.
    [os_company_url] => http://www.microsoft.com/
    [os_icon] => windowsxp.png
)

The field typ is set to browser when the UA is identified as likely belonging to a human visitor, in which case you can update your stats.

Couple of caveats here:

  • You're relying on UAS.info for the user agent strings provided to be accurate and up-to-date
  • Bots like google and yahoo declare themselves in their user agent strings, but this method will still count visits from bots pretending to be human visitors (sending spoofed UAs)
  • As @amdfan mentioned above, blocking bots via robots.txt should stop most of them from reaching your page. If you need the content to be indexed but not increment stats, then the robots.txt method won't be a realistic option