Filter Comment Spam? PHP

When writing your own method, you'll have to employ a combination of heuristics.

For example, it's very common for spam comments to have 2 or more URL links.

I'd begin writing your filter like so, using a dictionary of trigger words and have it loop through and use those to determine probability:

function spamProbability($text){
    $probability = 0;  
    $text = strtolower($text); // lowercase it to speed up the loop
    $myDict = array("http","penis","pills","sale","cheapest"); 
    foreach($myDict as $word){
        $count = substr_count($text, $word);
        $probability += .2 * $count;
    }
    return $probability;
}

Note that this method will result in many false positives, depending on your word set; you could have your site "flag" for moderation (but goes live immediately) those with probability > .3 and < .6, have it require those >.6 and <.9 enter a queue for moderation (where they don't appear until approved), and then anything over >1 is simply rejected.

Obviously these are all values you'll have to tweak the thresholds but this should start you off with a pretty basic system. You can add to it several other qualifiers for increasing / decreasing probability of spam, such as checking the ratio of bad words to words, changing weights of words, etc.


I'm surprised no one mentioned Akismet. I've never had a message marked wrong (be it spam or legit). My WordPress install came with it. All I had to do was hit enable.