Bayesian spam filtering

Posted on June 3, 2012 by rpflee

Bayesian spam filtering (after Rev. Thomas Bayes) is a statistical technique of e-mail filtering. It makes use of a naive Bayes classifier to identify spam e-mail.

Disadvantages

Depending on the implementation, Bayesian spam filtering may be susceptible to Bayesian poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources). Spammer tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email’s spam score, making it more likely to slip past a Bayesian spam filter.

However with (for example) Paul Graham’s scheme only the most significant probabilities are used, so that padding the text out with non-spam-related words does not affect the detection probability significantly.

Words that normally appear in large quantities in spam may also be transformed by spammers. For example, ? Viagra ? would be replaced with ? Viaagra ? or ? V!agra ? in the spam message. The recipient of the message can still read the changed words, but each of these words is met more rarely by the bayesian filter, which hinders its learning process.

As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones.

Another technique used to try to defeat Bayesian spam filters is to replace text with pictures, either directly included or linked. The whole text of the message, or some part of it, is replaced with a picture where the same text is “drawn”. The spam filter is usually unable to analyze this picture, which would contain the sensitive words like “Viagra”.

However, since many mail clients disable the display of linked pictures for security reasons, the spammer sending links to distant pictures might reach fewer targets. Also, a picture’s size in bytes is bigger than the equivalent text’s size, so the spammer needs more bandwidth to send messages directly including pictures. Some filters are more inclined to decide that a message is spam if it has mostly graphical contents. Finally, a probably more efficient solution has been proposed by Google and is used by its Gmail email system, performing an OCR (Optical Character Recognition) to every mid to large size image, analyzing the text inside.

— Wikipedia on Bayesian spam filtering

2012.06.03 Sunday ACHK

The why of love, 2.1

Posted on June 3, 2012 by rpflee

軟硬智力 7.1

How to answer this kind of questions:

Why am I so stupid?

It is not a valid question.

It is not the case that there is a pre-existing “I”, to which we can assign some qualities such as stupidity.

Instead, I am the sum of all my qualities, including the quality of being stupid.

— Me@2011.10.18

應世守略 3.2

Posted on June 3, 2012 by rpflee

這段改編自 2010 年 3 月 20 日的對話。

（安：那有什麼方法，令到自己不再害怕孤獨呢？）

根據哲學家叔本華所講，人有如「寒冬裡的刺蝟」：

一方面，牠們要走近對方，互相取暖；另一方面，走得太近，又會刺傷對方。

那是一個兩難。

但是，「平民百姓刺蝟」以外，還有一些「至尊星級刺蝟」。牠們每隻的體內，都有一個「核子反應堆」，可以自己發熱發暖。所以，牠們中的每一隻，都無懼寒冷孤獨，毋須走近其他刺蝟。

那就化解了兩難。

一隻「至尊星級刺蝟」，間中會走近其他刺蝟。那並不是因為「被迫」或者「需要」，而是因為牠「希望」或者「享受」。如果牠去見其他「至尊星級刺蝟」，是因為牠想和其他刺蝟，交換「暖氣善意」；如果牠去見其他「平民百姓刺蝟」，是因為牠想在其他刺蝟之中，散播「暖氣善意」。

同理，你的精神世界越豐富，你就越不「需要」去見其他人。一方面，你會「希望」（而不是「需要」）與其他人相處；另一方面，你亦會十分享受，與自己相處。

相反，你的心靈生活越沉悶，你就越需要參加群體活動，以分散自己的注意力，盡力避免與自己相處。

— Me@2012.06.03

S	M	T	W	T	F	S
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

Physics Town

continues

Daily Archives: June 3, 2012

Bayesian spam filtering

The why of love, 2.1

應世守略 3.2