Bayesian spam filtering

Bayesian spam filtering (after Rev. Thomas Bayes) is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.


Depending on the implementation, Bayesian spam filtering may be susceptible to Bayesian poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources). Spammer tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email’s spam score, making it more likely to slip past a Bayesian spam filter.

However with (for example) Paul Graham’s scheme only the most significant probabilities are used, so that padding the text out with non-spam-related words does not affect the detection probability significantly.

Words that normally appear in large quantities in spam may also be transformed by spammers. For example, ? Viagra ? would be replaced with ? Viaagra ? or ? V!agra ? in the spam message. The recipient of the message can still read the changed words, but each of these words is met more rarely by the bayesian filter, which hinders its learning process.

As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones.

Another technique used to try to defeat Bayesian spam filters is to replace text with pictures, either directly included or linked. The whole text of the message, or some part of it, is replaced with a picture where the same text is “drawn”. The spam filter is usually unable to analyze this picture, which would contain the sensitive words like “Viagra”.

However, since many mail clients disable the display of linked pictures for security reasons, the spammer sending links to distant pictures might reach fewer targets. Also, a picture’s size in bytes is bigger than the equivalent text’s size, so the spammer needs more bandwidth to send messages directly including pictures. Some filters are more inclined to decide that a message is spam if it has mostly graphical contents. Finally, a probably more efficient solution has been proposed by Google and is used by its Gmail email system, performing an OCR (Optical Character Recognition) to every mid to large size image, analyzing the text inside.

— Wikipedia on Bayesian spam filtering

2012.06.03 Sunday ACHK