How does Bayesian poisoning work?
There's a good paper published named Bachelor thesis:The Effects of Different Bayesian Poison Methods on the Quality of the Bayesian Spam Filter ‘SpamBayes’ by Martijn Sprengers.
I'll try to make TL;DR:
Bayesian spamfilters try to decide if an email is spam or not by looking at keywords in an email. What it does is review the words present in normal and spam email and update the scores for each word. These scores are used to deduce if an email is spam or not by making a score based on the overal score of words present in the email.
Words are re-scored, meaning that if "Viagra" appears in several normal emails, it will get a lower score over time. This is abused by spammers by generating email with several low scoring words, commonly found in legitimate emails and adding a single bad word. Because the score of the email will overall be considered good "Viagra" will get a lower score over time making it a legitimate word, and causing spam email to pass through spam filters.
There are three attacks the paper discusses:
Random Words: This attack method is based on the research by Gregory et al. [6]. It can be seen as a weak statistical attack, because it uses purely randomized data to add to the spam e-mails.
Common Words: This attack method is based on the research by Stern et al. [7]. They added common English words to spam e-mails in order to confuse the spam filter. This attack can be seen as stronger statistical attack than the Random Words method, because the data used is less random and it contains words that are more likely to be in e-mails than the words added with the previous attack.
Ham Phrases: This attack is developed in this research and tested against the other two. It is based on a huge collection of ham e-mails. From that collection, only the ham e-mails with the lowest combined probability are used as poison. The ham e-mail is then added at the end of the original spam e-mail. Most people read downwards, so the effectiveness of the message is maintained. This is also a strong statistical attack, maybe even stronger than the Common Words attack, because the words are even less randomized.
Highlights from the paper's conclusion:
From a spammer’s point of view, the ‘HamPhrases’ technique seems to work best. It does decrease the performance of the spam filter. … The ‘Random’ and ‘Common Words’ techniques seem to score worse from a spammers point of view. … When we train the spam filter on those poison methods, the performance gets even better than normal. …
However, the HamPhrases method used in this research is a little bit cheating. This is because both ham and spam e-mails that the spam filter uses for testing and training are available for the algorithm. Real spammers do not have the ham e-mails of real users.
Lucas Kauffman answer explain the how very well, as for why:
If the user is failing to receive important emails and it turns out they got caught in the spam filter then they'll get angry at their admin. False positives can have a very high cost.
When a lot of users get angry at the admin the admin is likely to change things so that the spam filter is more forgiving which is likely to end up letting more spam through which is good for the spammers.
I have a great example of a spam message with Bayesian poisoning in an old blog post.
Bayesian spam filters basically keep track of each word used in each message. When a message is marked as spam, the filter treats the words in the message as representative of spam. By using this information, the filter can determine with good accuracy whether a particular message is spam or not.
However, the fact that Bayesian filters use the words in each message to determine whether a message is spam makes it susceptible to techniques that circumvent this process.
A spam message can insert nonsense words, break the words apart in a human-readable (but not machine-readable) fashion (e.g. insert "invisible" small letters between each letter in the spammy word), use accent marks or HTML entities to make it harder to distinguish by filters, or use HTML forms in place of links. This is essentially what Bayesian poisoning is, and all of these techniques are demonstrated and explained in my blog post.
In particular, the "nonsense words" can be carefully chosen to be those commonly found in normal messages. A user marking a spam message containing these words as spam is essentially telling the filter to treat them as an indication of spam. With enough such messages, the filter will think that these words represent spam and begin to mark legitimate messages containing these words as such.
The first image in the blog post demonstrates how this is done:
View full size
Although the full sentences don't make a lot of sense, they look somewhat coherent. "Smiling at that", "God knew he waited", and "Behind the bed" are all phrases and words which can appear in normal messages. If these kinds of phrases appear often enough in spam messages and the user marks them as spam, the filter could end up thinking legitimate messages with these phrases are spam.