Bayesian spam filtering

Bayesian spam filtering - a method for filtering spam , based on the use of a naive Bayesian classifier , based on the direct use of Bayesian theorem . Bayes' Theorem is named after its author, Thomas Bayes (1702-1761), an English mathematician and priest who was the first to use the theorem to correct beliefs based on updated data.

Content

History

The first known email filtering program using the Bayesian classifier was Jason Renny's iFile, released in 1996. The program used mail sorting by folders ^[1] . The first academic publication on naive Bayesian spam filtering appeared in 1998 ^[2] . Soon after this publication, work began on creating commercial spam filters. . However, in 2002, Paul Graham was able to significantly reduce the number of false positives to such an extent that the Bayesian filter could be used as the only spam filter ^[3] ^[4] ^[5] .

Modifications of the main approach have been developed in many research papers and implemented in software products ^[6] . Many modern email clients perform Bayesian spam filtering. Users can also install separate mail filtering programs. Filters for the mail server — such as DSPAM , SpamAssassin , SpamBayes , SpamProbe , Bogofilter , CRM114 — use Bayesian spam filtering methods ^[5] . E-mail server software either includes filters in its delivery, or provides an API for connecting external modules.

Description

When training a filter for each word found in letters, its “weight” is calculated and its “weight” is saved - an estimate of the probability that a letter with this word is spam. In the simplest case, the frequency is used as an estimate: “occurrences in spam / occurrences in all”. In more complex cases, pre-processing of the text is possible: reduction of words to the initial form, removal of service words, calculation of “weight” for whole phrases, transliteration , etc.

When checking a new message, the probability of “spam” is calculated using the above formula for many hypotheses. In this case, “hypotheses” are words, and for each word “reliability of a hypothesis” $P(A_{i})=N_{word_{i}}/N_{words~total}$ ${\ displaystyle P (A_ {i}) = N_ {word_ {i}} / N_ {words ~ total}}$ - the proportion of this word in the letter, and “the dependence of the event on the hypothesis” $P(B\mid A_{i})$ ${\ displaystyle P (B \ mid A_ {i})}$ - previously calculated "weight" of the word. That is, the "weight" of the letter in this case is the average "weight" of all his words.

A letter is classified as “spam” or “non-spam” by whether its “weight” exceeds a certain bar set by the user (usually they take 60-80%). After making a decision on a letter, the "weights" for the words included in it are updated in the database.

Mathematical Foundations

Bayesian mail filters are based on the Bayesian theorem. Bayes theorem is used several times in the context of spam:

for the first time, to calculate the probability that the message is spam, knowing that the given word appears in this message;
a second time to calculate the likelihood that the message is spam, taking into account all its words (or their corresponding subsets);
sometimes the third time messages with rare words are found.

Calculating the probability that a message containing a given word is spam

Let's assume that the suspected message contains the word “Replica”. Most people who are used to receiving e-mails know that this message is likely to be spam, or rather, an offer to sell fake copies of watches of famous brands. The spam detection program, however, does not “know” such facts; all she can do is calculate the probabilities.

The formula used by the software to determine this is derived from Bayes' theorem and the total probability formula :

\Pr(S\mid W)={\frac {\Pr(W\mid S)\cdot \Pr(S)}{\Pr(W)}}={\frac {\Pr(W\mid S)\cdot \Pr(S)}{\Pr(W\mid S)\cdot \Pr(S)+\Pr(W\mid H)\cdot \Pr(H)}}

{\ displaystyle \ Pr (S \ mid W) = {\ frac {\ Pr (W \ mid S) \ cdot \ Pr (S)} {\ Pr (W)}} = {\ frac {\ Pr (W \ mid S) \ cdot \ Pr (S)} {\ Pr (W \ mid S) \ cdot \ Pr (S) + \ Pr (W \ mid H) \ cdot \ Pr (H)}}}

Where:

$\Pr(S\mid W)$ ${\ displaystyle \ Pr (S \ mid W)}$ - the conditional probability that the message is spam, provided that the word "Replica" is in it;
$\Pr(S)$ ${\ displaystyle \ Pr (S)}$ - the full probability that an arbitrary message is spam;
$\Pr(W\mid S)$ ${\ displaystyle \ Pr (W \ mid S)}$ - the conditional probability that the word "replica" appears in messages if they are spam;
$\Pr(H)$ ${\ displaystyle \ Pr (H)}$ - the full probability that an arbitrary message is not spam (that is, "ham");
$\Pr(W\mid H)$ ${\ displaystyle \ Pr (W \ mid H)}$ - the conditional probability that the word "replica" appears in messages if they are "ham".

Spam words

Recent statistical studies ^[7] have shown that today the probability of any message being spam is at least 80%: $\Pr(S)=0.8;\Pr(H)=0.2$ ${\ displaystyle \ Pr (S) = 0.8; \ Pr (H) = 0.2}$ .

However, most Bayesian spam detection programs make the assumption that there is no a priori preference for the message to be “spam” rather than “ham”, and believes that in both cases there are 50% equal probabilities: $\Pr(S)=0.5,\Pr(H)=0.5$ ${\ displaystyle \ Pr (S) = 0.5, \ Pr (H) = 0.5}$ .

Filters that use this hypothesis are referred to as “open-minded” filters. This means that they have no prejudice regarding incoming email. This assumption allows us to simplify the general formula to:

\Pr(S\mid W)={\frac {\Pr(W\mid S)}{\Pr(W\mid S)+\Pr(W\mid H)}}

{\ displaystyle \ Pr (S \ mid W) = {\ frac {\ Pr (W \ mid S)} {\ Pr (W \ mid S) + \ Pr (W \ mid H)}}}

Value $Pr(S|W)$ ${\ displaystyle Pr (S | W)}$ called spamming words $W$ ${\ displaystyle W}$ ; while the number $\Pr(W|S)$ ${\ displaystyle \ Pr (W | S)}$ used in the above formula is approximately equal to the relative frequency of messages that contain the word $W$ ${\ displaystyle W}$ in messages identified as spam during the training phase, that is:

Pr(W_{i}\mid S)={\frac {\mathrm {count} (M:W_{i}\in M,M\in S)}{\sum _{j}\mathrm {count} (M:W_{j}\in M,M\in S)}}

{\ displaystyle Pr (W_ {i} \ mid S) = {\ frac {\ mathrm {count} (M: W_ {i} \ in M, M \ in S)} {\ sum _ {j} \ mathrm { count} (M: W_ {j} \ in M, M \ in S)}}}

Similar $\Pr(W|H)$ ${\ displaystyle \ Pr (W | H)}$ approximately equal to the relative frequency of messages containing the word $W$ ${\ displaystyle W}$ in messages identified as “ham” during the training phase.

Pr(W_{i}\mid H)={\frac {\mathrm {count} (M:W_{i}\in M,M\in H)}{\sum _{j}\mathrm {count} (M:W_{j}\in M,M\in H)}}

{\ displaystyle Pr (W_ {i} \ mid H) = {\ frac {\ mathrm {count} (M: W_ {i} \ in M, M \ in H)} {\ sum _ {j} \ mathrm { count} (M: W_ {j} \ in M, M \ in H)}}}

In order for these approximations to make sense, the set of training messages should be large and representative enough. It is also desirable that the set of training messages correspond to 50% of the hypothesis of redistribution between spam and “ham”, that is, that the message sets “spam” and “ham” had the same size.

Of course, determining whether a spam or ham message is based only on the presence of only one specific word is error prone. That is why Bayesian spam filters try to examine a few words and combine their spam to determine the total likelihood that the message is spam.

Combining individual probabilities

Software spam filters based on the principles of the naive Bayes classifier make a “naive” assumption that events corresponding to the presence of a word in an e-mail or message are independent of each other. This simplification is generally not true for natural languages - such as English, where the probability of finding an adjective increases if there is, for example, a noun. Based on such a “naive” assumption, to solve the problem of classifying messages into only 2 classes: $S$ ${\ displaystyle S}$ (spam) and $H=\neg S$ ${\ displaystyle H = \ neg S}$ (“Ham”, that is, not spam) from the Bayes theorem, we can derive the following formula for assessing the probability of “spam” of the entire message containing the words $W_{1},W_{2},...W_{N}$ ${\ displaystyle W_ {1}, W_ {2}, ... W_ {N}}$ :

p(S\mid W_{1},W_{2},...W_{N})=

{\ displaystyle p (S \ mid W_ {1}, W_ {2}, ... W_ {N}) =}

[by Bayes theorem]

={\frac {p(W_{1},W_{2},...W_{N}\mid S)\cdot p(S)}{p(W_{1},W_{2},...W_{N})}}=

{\ displaystyle = {\ frac {p (W_ {1}, W_ {2}, ... W_ {N} \ mid S) \ cdot p (S)} {p (W_ {1}, W_ {2} , ... W_ {N})}} =}

[because

W_{i}

{\ displaystyle W_ {i}}

assumed independent]

=

{\ displaystyle =}

={\frac {\prod _{i}p(W_{i}\mid S)\cdot p(S)}{p(W_{1},W_{2},...W_{N})}}=

{\ displaystyle = {\ frac {\ prod _ {i} p (W_ {i} \ mid S) \ cdot p (S)} {p (W_ {1}, W_ {2}, ... W_ {N })}} =}

[by Bayes theorem]

={\frac {\prod _{i}{\frac {p(S\mid W_{i})\cdot p(W_{i})}{p(S)}}\cdot p(S)}{p(W_{1},W_{2},...W_{N})}}=

{\ displaystyle = {\ frac {\ prod _ {i} {\ frac {p (S \ mid W_ {i}) \ cdot p (W_ {i})} {p (S)}} \ cdot p (S )} {p (W_ {1}, W_ {2}, ... W_ {N})}} =}

[by the formula of total probability ]

=

{\ displaystyle =}

={\frac {\prod _{i}{\frac {p(S\mid W_{i})\cdot p(W_{i})}{p(S)}}\cdot p(S)}{\prod _{i}(p(W_{i}\mid S))\cdot p(S)+\prod _{i}(p(W_{i}\mid \neg S))\cdot p(\neg S)}}=

{\ displaystyle = {\ frac {\ prod _ {i} {\ frac {p (S \ mid W_ {i}) \ cdot p (W_ {i})} {p (S)}} \ cdot p (S )} {\ prod _ {i} (p (W_ {i} \ mid S)) \ cdot p (S) + \ prod _ {i} (p (W_ {i} \ mid \ neg S)) \ cdot p (\ neg S)}} =}

={\frac {\prod _{i}(p(S\mid W_{i})\cdot p(W_{i}))\cdot p(S)^{1-N}}{\prod _{i}(p(S\mid W_{i})\cdot p(W_{i}))\cdot p(S)^{1-N}+\prod _{i}(p(\neg S\mid W_{i})\cdot p(W_{i}))\cdot p(\neg S)^{1-N}}}=

{\ displaystyle = {\ frac {\ prod _ {i} (p (S \ mid W_ {i}) \ cdot p (W_ {i})) \ cdot p (S) ^ {1-N}} {\ prod _ {i} (p (S \ mid W_ {i}) \ cdot p (W_ {i})) \ cdot p (S) ^ {1-N} + \ prod _ {i} (p (\ neg S \ mid W_ {i}) \ cdot p (W_ {i})) \ cdot p (\ neg S) ^ {1-N}}} =}

={\frac {\prod _{i}p(S\mid W_{i})}{\prod _{i}(p(S\mid W_{i}))+\left({\frac {p(\neg S)}{p(S)}}\right)^{1-N}\cdot \prod _{i}p(\neg S\mid W_{i})}}

{\ displaystyle = {\ frac {\ prod _ {i} p (S \ mid W_ {i})} {\ prod _ {i} (p (S \ mid W_ {i})) + \ left ({\ frac {p (\ neg S)} {p (S)}} \ right) ^ {1-N} \ cdot \ prod _ {i} p (\ neg S \ mid W_ {i})}}}

Thus, assuming $p(S)=p(\neg S)=0.5$ ${\ displaystyle p (S) = p (\ neg S) = 0.5}$ , we have:

p={\frac {p_{1}p_{2}\cdots p_{N}}{p_{1}p_{2}\cdots p_{N}+(1-p_{1})(1-p_{2})\cdots (1-p_{N})}}

{\ displaystyle p = {\ frac {p_ {1} p_ {2} \ cdots p_ {N}} {p_ {1} p_ {2} \ cdots p_ {N} + (1-p_ {1}) (1 -p_ {2}) \ cdots (1-p_ {N})}}}

Where:

$p=Pr(S\mid W_{1},W_{2},...,W_{N})$ ${\ displaystyle p = Pr (S \ mid W_ {1}, W_ {2}, ..., W_ {N})}$ - the probability that the message containing the words $W_{1},W_{2},...,W_{N}$ ${\ displaystyle W_ {1}, W_ {2}, ..., W_ {N}}$ - spam;
$p_{1}$ ${\ displaystyle p_ {1}}$ - conditional probability $p(S\mid W_{1})$ ${\ displaystyle p (S \ mid W_ {1})}$ that the message is spam, provided that it contains the first word (for example, "replica");
$p_{2}$ ${\ displaystyle p_ {2}}$ - conditional probability $p(S\mid W_{2})$ ${\ displaystyle p (S \ mid W_ {2})}$ that the message is spam, provided that it contains a second word (for example, “watches”);
etc.
$p_{N}$ ${\ displaystyle p_ {N}}$ - conditional probability $p(S\mid W_{N})$ ${\ displaystyle p (S \ mid W_ {N})}$ that the message is spam, provided that it contains the Nth word (for example, “home”).

(Demonstration: ^[8] )

The result of p is usually compared with a certain threshold (for example, $0.5$ ${\ displaystyle 0.5}$ ) to decide if the message is spam or not. If p is lower than the threshold, the message is considered probable “ham”, otherwise it is considered as probable spam.

The Rare Word Problem

It occurs if the word never occurred during the training phase: both the numerator and the denominator are equal to zero, both in the general formula and in the spam formula.

In general, the words that the program encountered only a few times during the training phase are not representative (the data set in the sample is small in order to make a reliable conclusion about the property of such a word). A simple solution is to ignore such unreliable words.

Other heuristic enhancements

“Neutral” words - such as “the”, “a”, “some”, or “is” (in English), or their equivalents in other languages — can be ignored. Generally speaking, some Bayesian filters simply ignore all words that have a spam rating of about 0.5, since in this case a qualitatively better solution is obtained. Only those words are considered that have a spam count of about 0.0 (the distinguishing feature of legitimate messages is “ham”), or next to 1.0 (distinguishing features of spam). The dropout method can be configured, for example, to keep only those ten words in the examined message that have the largest absolute value | 0.5 - Pr |.

Some software products take into account the fact that a certain word appears several times in the message being checked ^[9] , while others do not.

Some software products use phrases - patterns (sequences of words) instead of isolated words in natural languages ^[10] . For example, with a four-word “contextual window”, they calculate the spamming of the phrase “Viagra, good for,” instead of calculating the spamming of the individual words “Viagra,” “good,” and “for.” This method is more context sensitive and better eliminates Bayesian noise - due to the larger database.

Mixed Methods

In addition to the “naive” Bayesian approach, there are other ways to combine — combine individual probabilities for different words. These methods differ from the "naive" method in the assumptions they make about the statistical properties of the input data. Two different hypotheses lead to radically different formulas for the totality (association) of individual probabilities.

For example, to test the assumption of a set of individual probabilities, the logarithm of the product of which, up to a constant, obeys a chi-square distribution with 2 N degrees of freedom, you can use the formula:

p=C^{-1}(-2\ln(p_{1}p_{2}\cdots p_{N}),2N)

{\ displaystyle p = C ^ {- 1} (- 2 \ ln (p_ {1} p_ {2} \ cdots p_ {N}), 2N)}

where C ⁻¹ denotes the inverse function for the chi-square distribution function (see ).

Individual probabilities can also be combined using Markov discrimination methods.

Feature

This method is simple (the algorithms are elementary), convenient (allows you to do without “black lists” and similar artificial tricks), is effective (after training on a sufficiently large sample, it cuts up to 95–97% of spam), and in case of any errors it can be retrained. In general, there is all indications for its widespread use, which is the case in practice - almost all modern spam filters are built on its basis.

However, the method also has a fundamental flaw: it is based on the assumption that some words are more often found in spam, and others in ordinary letters , and is ineffective if this assumption is incorrect. However, as practice shows, even a person is not able to determine such spam “by eye” - only by reading the letter and understanding its meaning. There is a Bayesian poisoning method , which allows you to add a lot of extra text, sometimes carefully chosen to “trick” the filter.

Another non-fundamental drawback associated with the implementation is that the method works only with text. Knowing this limitation, spammers began to put advertising information in the picture. The text in the letter is either missing or does not make sense. Against this, one has to use either text recognition tools (an “expensive” procedure, applied only when absolutely necessary) or old filtering methods — black lists and regular expressions (since such letters often have a stereotypical form).

Notes

↑ Jason Rennie. ifile ( unopened ) (1996). Archived on October 25, 2012.
↑ Sahami, Dumais, Heckerman, Horvitz, 1998 .
↑ Paul Graham (2003), Better Bayesian filtering
↑ Brian Livingston (2002), Paul Graham provides stunning answer to spam e-mails
↑ ¹ ² Guzella, Caminhas, 2009 .
↑ Junk Mail Controls (unspecified) . MozillaZine (November 2009). Archived on October 25, 2012.
↑ More Than 90 Percent of E-Mails in Third Quarter (of 2008) Were Spam, Certification Magazine
↑ Combining probabilities (neopr.) . Archived on April 16, 2012. at MathPages
↑ Brian Burton. SpamProbe - Bayesian Spam Filtering Tweaks ( Neopr .) (2003). Archived on April 16, 2012.
↑ Jonathan A. Zdziarski. Bayesian Noise Reduction: Contextual Symmetry Logic Utilizing Pattern Consistency Analysis (neopr.) (Inaccessible link - history ) (2004). (inaccessible link)

Literature

Guzella T. S., Caminhas W. M. A review of machine learning approaches to Spam filtering // Expert Systems with Applications. - 2009. - Vol. 36, no. 7. - P. 10206-10222. - DOI : 10.1016 / j.eswa.2009.02.02.037 .
Metsis V., Androutsopoulos I., Paliouras G. Spam Filtering with Naive Bayes - Which Naive Bayes? // CEAS 2006: Third Conference on Email and Anti-Spam, July 27-28, 2006, Mountain View, California, USA. - 2006.
Sahami M., Dumais S., Heckerman D., Horvitz E. A Bayesian approach to filtering junk email // AAAI Workshop on Learning for Text Categorization. Technical Report. - 1998.

Links

Paul Graham. A plan for spam // Paul Graham's personal site.

[1] Jason Rennie. ifile ( unopened ) (1996). Archived on October 25, 2012.

[_1b3d83cc373091c0-2] Sahami, Dumais, Heckerman, Horvitz, 1998 .

[3] Paul Graham (2003), Better Bayesian filtering

[4] Brian Livingston (2002), Paul Graham provides stunning answer to spam e-mails

[_b1a32847dce4082c-5] ¹ ² Guzella, Caminhas, 2009 .

[6] Junk Mail Controls (unspecified) . MozillaZine (November 2009). Archived on October 25, 2012.

[7] More Than 90 Percent of E-Mails in Third Quarter (of 2008) Were Spam, Certification Magazine

[8] Combining probabilities (neopr.) . Archived on April 16, 2012. at MathPages

[9] Brian Burton. SpamProbe - Bayesian Spam Filtering Tweaks ( Neopr .) (2003). Archived on April 16, 2012.

[10] Jonathan A. Zdziarski. Bayesian Noise Reduction: Contextual Symmetry Logic Utilizing Pattern Consistency Analysis (neopr.) (Inaccessible link - history ) (2004). (inaccessible link)