Document Classification

The classification of documents is one of the tasks of information retrieval , which consists in assigning a document to one of several categories based on the content of the document.

Classification can be carried out completely manually, either automatically using a manually created set of rules, or automatically using machine learning methods.

It is necessary to distinguish the classification of texts from clustering , in the latter case, the texts are also grouped according to some criteria, but there are no predefined categories.

Content

1 Approaches to the classification of texts
2 Statement of the problem
3 stages of processing
4 Learning Methods
- 4.1 Naive Bayesian model
5 Application
6 notes
7 Literature
8 See also
9 References

Text classification approaches

There are three approaches to the task of classifying texts ^[1] .

Firstly, classification is not always carried out using a computer. For example, in a regular library, subject headings are assigned to books manually by a librarian. Such a manual classification is expensive and not applicable in cases where it is necessary to classify a large number of documents at high speed.

Another approach is to write the rules by which the text can be attributed to one or another category. For example, one of these rules may look like this: "if the text contains the words derivative and equation , then classify it as a mathematician ." A specialist who is familiar with the subject area and has the skill to write regular expressions can draw up a series of rules that are then automatically applied to incoming documents to classify them. This approach is better than the previous one, since the classification process is automated and, therefore, the number of processed documents is practically unlimited. Moreover, building rules manually can provide better classification accuracy than machine learning (see below). However, creating and maintaining the rules up to date (for example, if the name of the current president of the country is used to classify the news, the corresponding rule needs to be changed from time to time) requires the constant efforts of a specialist.

Finally, the third approach is based on machine learning . In this approach, a set of rules or, more generally, a decision criterion for a text classifier is calculated automatically from the training data (in other words, the classifier is trained). Training data is a number of good sample documents from each class. In machine learning, the need for manual markup remains (the term markup means the process of assigning a class to a document). But markup is a simpler task than writing rules. In addition, markup can be performed in the normal mode of use of the system. For example, in an e-mail program, it may be possible to mark emails as spam, thereby forming a training set for a classifier - a filter for unwanted messages. Thus, the classification of texts based on machine learning is an example of learning with a teacher , where a person acts as a teacher, defining a set of classes and marking up the learning set.

Statement of the Problem

There are many categories (classes, labels) ${\mathfrak {C}}=\{c_{1},...,c_{\left|{\mathfrak {C}}\right|}\}$ ${\ displaystyle {\ mathfrak {C}} = \ {c_ {1}, ..., c _ {\ left | {\ mathfrak {C}} \ right |} \}}$ .

There are many documents ${\mathfrak {D}}=\{d_{1},...,d_{\left|{\mathfrak {D}}\right|}\}$ ${\ displaystyle {\ mathfrak {D}} = \ {d_ {1}, ..., d _ {\ left | {\ mathfrak {D}} \ right |} \}}$ .

Unknown target function $\Phi \colon {\mathfrak {C}}\times {\mathfrak {D}}\rightarrow \{0,1\}$ ${\ displaystyle \ Phi \ colon {\ mathfrak {C}} \ times {\ mathfrak {D}} \ rightarrow \ {0,1 \}}$ .

It is necessary to build a classifier $\Phi ^{\prime }$ ${\ displaystyle \ Phi ^ {\ prime}}$ as close to $\Phi$ ${\ displaystyle \ Phi}$ .

There is some initial collection of tagged documents ${\mathfrak {R}}\subset {\mathfrak {C}}\times {\mathfrak {D}}$ ${\ displaystyle {\ mathfrak {R}} \ subset {\ mathfrak {C}} \ times {\ mathfrak {D}}}$ for which the values are known $\Phi$ ${\ displaystyle \ Phi}$ . Usually it is divided into “training” and “test” parts. The first is used to train the classifier, the second - for independent verification of the quality of its work.

The classifier can give an exact answer. $\Phi ^{\prime }\colon {\mathfrak {C}}\times {\mathfrak {D}}\rightarrow \{0,1\}$ ${\ displaystyle \ Phi ^ {\ prime} \ colon {\ mathfrak {C}} \ times {\ mathfrak {D}} \ rightarrow \ {0,1 \}}$ or degree of similarity $\Phi ^{\prime }\colon {\mathfrak {C}}\times {\mathfrak {D}}\rightarrow [0,1]$ ${\ displaystyle \ Phi ^ {\ prime} \ colon {\ mathfrak {C}} \ times {\ mathfrak {D}} \ rightarrow [0,1]}$ .

Processing Steps

Document Indexing: The construction of a numerical model of the text, for example, in the form of a multidimensional vector of words and their weight in the document. Reducing the dimension of the model.

Construction and training of the classifier: Various machine learning methods can be used: decision trees , naive Bayes classifier , neural networks , support vector method , etc.

Classification Quality Assessment: You can evaluate by the criteria of completeness, accuracy, compare classifiers by special test sets.

Learning Methods

Naive Bayesian Model

The naive Bayesian model is a probabilistic learning method. The probability that a document d gets into class c is written as $P(c|d)$ ${\ displaystyle P (c | d)}$ . Since the purpose of the classification is to find the most suitable class for this document, in a naive Bayesian classification, the task is to find the most probable class c _m

$c_{m}={\underset {c\in C}{\operatorname {argmax} }}\,P(c|d)$ ${\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} \, P (c | d)}$

It is impossible to directly calculate the value of this probability, because for this it is necessary that the training set contains all (or almost all) possible combinations of classes and documents. However, using the Bayes formula, we can rewrite the expression for $P(c|d)$ ${\ displaystyle P (c | d)}$

$c_{m}={\underset {c\in C}{\operatorname {argmax} }}\,{\frac {P(d|c)P(c)}{P(d)}}={\underset {c\in C}{\operatorname {argmax} }}\,P(d|c)P(c)$ ${\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} \, {\ frac {P (d | c) P (c)} {P (d)}} = {\ underset {c \ in C} {\ operatorname {argmax}}}, P (d | c) P (c)}$

where is the denominator $P(d)$ ${\ displaystyle P (d)}$ omitted, since it does not depend on c and, therefore, does not affect the finding of the maximum; P (c) is the probability that class c will occur, regardless of the document in question; P (d | c) - probability to meet document d among documents of class c .

Using the training set, the probability P (c) can be estimated as

${\hat {P}}(c)={\frac {N_{c}}{N}}$ ${\ displaystyle {\ hat {P}} (c) = {\ frac {N_ {c}} {N}}}$

Where $N_{c}$ ${\ displaystyle N_ {c}}$ - the number of documents in class c , N - the total number of documents in the training set. Here, a different sign is used for probability, since with the help of the training set you can only estimate the probability, but not find its exact value.

To assess the likelihood $P(d|c)=P(t_{1},t_{2},...,t_{n_{d}}|c)$ ${\ displaystyle P (d | c) = P (t_ {1}, t_ {2}, ..., t_ {n_ {d}} | c)}$ where $t_{k}$ ${\ displaystyle t_ {k}}$ - term from document d , $n_{d}$ ${\ displaystyle n_ {d}}$ - the total number of terms in the document (including repetitions), it is necessary to introduce simplifying assumptions (1) on the conditional independence of terms and (2) on the independence of the positions of terms. In other words, we neglect, firstly, the fact that in a natural language text the appearance of one word is often closely related to the appearance of other words (for example, it is more likely that the word integral occurs in the same text with the word equation than with the word bacterium ) , and, secondly, that the probability of meeting the same word is different for different positions in the text. It is precisely because of these gross simplifications that the natural language model under consideration is called naive (nevertheless, it is quite effective in the classification problem). So, in the light of the assumptions made, using the rule of multiplying the probabilities of independent events, we can write

$P(d|c)=P(t_{1},t_{2},...,t_{n_{d}}|c)=P(t_{1}|c)P(t_{2}|c)...P(t_{n_{d}}|c)=\prod _{k=1}^{n_{d}}P(t_{k}|c)$ ${\ displaystyle P (d | c) = P (t_ {1}, t_ {2}, ..., t_ {n_ {d}} | c) = P (t_ {1} | c) P (t_ { 2} | c) ... P (t_ {n_ {d}} | c) = \ prod _ {k = 1} ^ {n_ {d}} P (t_ {k} | c)}$

Probability assessment $P(t|c)$ ${\ displaystyle P (t | c)}$ using the training set will

${\hat {P}}(t|c)={\frac {T_{ct}}{T_{c}}}$ ${\ displaystyle {\ hat {P}} (t | c) = {\ frac {T_ {ct}} {T_ {c}}}}$

Where $T_{ct}$ ${\ displaystyle T_ {ct}}$ - the number of occurrences of the term t in all documents of class c (and at any positions - the second simplifying assumption is essentially used here, otherwise it would be necessary to calculate these probabilities for each position in the document, which cannot be done quite accurately due to the sparseness of the training data - it is difficult to expect so that each term occurs in each position a sufficient number of times); $T_{c}$ ${\ displaystyle T_ {c}}$ - total number of terms in class c documents. When counting, all repeated occurrences are taken into account.

After the classifier is “trained”, that is, the values are found ${\hat {P}}(c)$ ${\ displaystyle {\ hat {P}} (c)}$ and ${\hat {P}}(t|c)$ ${\ displaystyle {\ hat {P}} (t | c)}$ , you can find the document class

$c_{m}={\underset {c\in C}{\operatorname {argmax} }}\,{\hat {P}}(d|c){\hat {P}}(c)={\underset {c\in C}{\operatorname {argmax} }}{\hat {P}}(c)\prod _{k=1}^{n_{d}}{\hat {P}}(t_{k}|c)$ ${\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} \, {\ hat {P}} (d | c) {\ hat {P}} (c) = {\ underset {c \ in C} {\ operatorname {argmax}}} {\ hat {P}} (c) \ prod _ {k = 1} ^ {n_ {d}} {\ hat {P}} ( t_ {k} | c)}$

In order to avoid the bottom overflow in the last formula due to the large number of factors, in practice, the sum of the logarithms is usually used instead of the product. Logarithm does not affect finding the maximum, since the logarithm is a monotonically increasing function. Therefore, in most implementations, instead of the last formula,

$c_{m}={\underset {c\in C}{\operatorname {argmax} }}[\log {\hat {P}}(c)+\sum _{k=1}^{n_{d}}\log {\hat {P}}(t_{k}|c)]$ ${\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} [\ log {\ hat {P}} (c) + \ sum _ {k = 1} ^ {n_ {d}} \ log {\ hat {P}} (t_ {k} | c)]}$

This formula has a simple interpretation. Chances to classify a document with a frequent class above, and the term $\log {\hat {P}}(c)$ ${\ displaystyle \ log {\ hat {P}} (c)}$ makes a corresponding contribution to the total amount. The quantities are $\log {\hat {P}}(t|c)$ ${\ displaystyle \ log {\ hat {P}} (t | c)}$ the more, the more important the term t for identifying class c , and, accordingly, the greater their contribution to the total.

Application

spam filtering
compiling online catalogs
selection of contextual advertising
in workflow systems
automatic abstracting (annotation)
disambiguation in automatic translation of texts
search engine restriction
definition of the coding and language of the text

Notes

↑ Manning et al. (2009) - p. 255

Literature

Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze An Introduction to Information Retrieval Draft. Online edition. Cambridge University Press. - 2009. - 544 p.

Links

Lecture No. 6 on the classification of texts of the course " Modern Problems of Theoretical Informatics " (statement of the problem, construction and training of the classifier, quality assessment).
F. Sebastiani. Machine Learning in Automated Text Categorization (PDF). (eng.)
"Text mining. Text classification." An example of document classification using STATISTICA software algorithms

[1] Manning et al. (2009) - p. 255