Clever Geek Handbook
📜 ⬆️ ⬇️

Document Classification

The classification of documents is one of the tasks of information retrieval , which consists in assigning a document to one of several categories based on the content of the document.

Classification can be carried out completely manually, either automatically using a manually created set of rules, or automatically using machine learning methods.

It is necessary to distinguish the classification of texts from clustering , in the latter case, the texts are also grouped according to some criteria, but there are no predefined categories.

Content

  • 1 Approaches to the classification of texts
  • 2 Statement of the problem
  • 3 stages of processing
  • 4 Learning Methods
    • 4.1 Naive Bayesian model
  • 5 Application
  • 6 notes
  • 7 Literature
  • 8 See also
  • 9 References

Text classification approaches

There are three approaches to the task of classifying texts [1] .

Firstly, classification is not always carried out using a computer. For example, in a regular library, subject headings are assigned to books manually by a librarian. Such a manual classification is expensive and not applicable in cases where it is necessary to classify a large number of documents at high speed.

Another approach is to write the rules by which the text can be attributed to one or another category. For example, one of these rules may look like this: "if the text contains the words derivative and equation , then classify it as a mathematician ." A specialist who is familiar with the subject area and has the skill to write regular expressions can draw up a series of rules that are then automatically applied to incoming documents to classify them. This approach is better than the previous one, since the classification process is automated and, therefore, the number of processed documents is practically unlimited. Moreover, building rules manually can provide better classification accuracy than machine learning (see below). However, creating and maintaining the rules up to date (for example, if the name of the current president of the country is used to classify the news, the corresponding rule needs to be changed from time to time) requires the constant efforts of a specialist.

Finally, the third approach is based on machine learning . In this approach, a set of rules or, more generally, a decision criterion for a text classifier is calculated automatically from the training data (in other words, the classifier is trained). Training data is a number of good sample documents from each class. In machine learning, the need for manual markup remains (the term markup means the process of assigning a class to a document). But markup is a simpler task than writing rules. In addition, markup can be performed in the normal mode of use of the system. For example, in an e-mail program, it may be possible to mark emails as spam, thereby forming a training set for a classifier - a filter for unwanted messages. Thus, the classification of texts based on machine learning is an example of learning with a teacher , where a person acts as a teacher, defining a set of classes and marking up the learning set.

Statement of the Problem

There are many categories (classes, labels)C={cone,...,c|C|} {\ displaystyle {\ mathfrak {C}} = \ {c_ {1}, ..., c _ {\ left | {\ mathfrak {C}} \ right |} \}}   .

There are many documentsD={done,...,d|D|} {\ displaystyle {\ mathfrak {D}} = \ {d_ {1}, ..., d _ {\ left | {\ mathfrak {D}} \ right |} \}}   .

Unknown target functionΦ:C×D→{0,one} {\ displaystyle \ Phi \ colon {\ mathfrak {C}} \ times {\ mathfrak {D}} \ rightarrow \ {0,1 \}}   .

It is necessary to build a classifierΦ′ {\ displaystyle \ Phi ^ {\ prime}}   as close toΦ {\ displaystyle \ Phi}   .

There is some initial collection of tagged documentsR⊂C×D {\ displaystyle {\ mathfrak {R}} \ subset {\ mathfrak {C}} \ times {\ mathfrak {D}}}   for which the values ​​are knownΦ {\ displaystyle \ Phi}   . Usually it is divided into “training” and “test” parts. The first is used to train the classifier, the second - for independent verification of the quality of its work.

The classifier can give an exact answer.Φ′:C×D→{0,one} {\ displaystyle \ Phi ^ {\ prime} \ colon {\ mathfrak {C}} \ times {\ mathfrak {D}} \ rightarrow \ {0,1 \}}   or degree of similarityΦ′:C×D→[0,one] {\ displaystyle \ Phi ^ {\ prime} \ colon {\ mathfrak {C}} \ times {\ mathfrak {D}} \ rightarrow [0,1]}   .

Processing Steps

Document Indexing
The construction of a numerical model of the text, for example, in the form of a multidimensional vector of words and their weight in the document. Reducing the dimension of the model.
Construction and training of the classifier
Various machine learning methods can be used: decision trees , naive Bayes classifier , neural networks , support vector method , etc.
Classification Quality Assessment
You can evaluate by the criteria of completeness, accuracy, compare classifiers by special test sets.

Learning Methods

Naive Bayesian Model

The naive Bayesian model is a probabilistic learning method. The probability that a document d gets into class c is written asP(c|d) {\ displaystyle P (c | d)}   . Since the purpose of the classification is to find the most suitable class for this document, in a naive Bayesian classification, the task is to find the most probable class c m

cm=argmaxc∈CP(c|d){\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} \, P (c | d)}  

It is impossible to directly calculate the value of this probability, because for this it is necessary that the training set contains all (or almost all) possible combinations of classes and documents. However, using the Bayes formula, we can rewrite the expression forP(c|d) {\ displaystyle P (c | d)}  

cm=argmaxc∈CP(d|c)P(c)P(d)=argmaxc∈CP(d|c)P(c){\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} \, {\ frac {P (d | c) P (c)} {P (d)}} = {\ underset {c \ in C} {\ operatorname {argmax}}}, P (d | c) P (c)}  

where is the denominatorP(d) {\ displaystyle P (d)}   omitted, since it does not depend on c and, therefore, does not affect the finding of the maximum; P (c) is the probability that class c will occur, regardless of the document in question; P (d | c) - probability to meet document d among documents of class c .

Using the training set, the probability P (c) can be estimated as

P^(c)=NcN{\ displaystyle {\ hat {P}} (c) = {\ frac {N_ {c}} {N}}}  

WhereNc {\ displaystyle N_ {c}}   - the number of documents in class c , N - the total number of documents in the training set. Here, a different sign is used for probability, since with the help of the training set you can only estimate the probability, but not find its exact value.

To assess the likelihoodP(d|c)=P(tone,t2,...,tnd|c) {\ displaystyle P (d | c) = P (t_ {1}, t_ {2}, ..., t_ {n_ {d}} | c)}   wheretk {\ displaystyle t_ {k}}   - term from document d ,nd {\ displaystyle n_ {d}}   - the total number of terms in the document (including repetitions), it is necessary to introduce simplifying assumptions (1) on the conditional independence of terms and (2) on the independence of the positions of terms. In other words, we neglect, firstly, the fact that in a natural language text the appearance of one word is often closely related to the appearance of other words (for example, it is more likely that the word integral occurs in the same text with the word equation than with the word bacterium ) , and, secondly, that the probability of meeting the same word is different for different positions in the text. It is precisely because of these gross simplifications that the natural language model under consideration is called naive (nevertheless, it is quite effective in the classification problem). So, in the light of the assumptions made, using the rule of multiplying the probabilities of independent events, we can write

P(d|c)=P(tone,t2,...,tnd|c)=P(tone|c)P(t2|c)...P(tnd|c)=∏k=onendP(tk|c){\ displaystyle P (d | c) = P (t_ {1}, t_ {2}, ..., t_ {n_ {d}} | c) = P (t_ {1} | c) P (t_ { 2} | c) ... P (t_ {n_ {d}} | c) = \ prod _ {k = 1} ^ {n_ {d}} P (t_ {k} | c)}  

Probability assessmentP(t|c) {\ displaystyle P (t | c)}   using the training set will

P^(t|c)=TctTc{\ displaystyle {\ hat {P}} (t | c) = {\ frac {T_ {ct}} {T_ {c}}}}  

WhereTct {\ displaystyle T_ {ct}}   - the number of occurrences of the term t in all documents of class c (and at any positions - the second simplifying assumption is essentially used here, otherwise it would be necessary to calculate these probabilities for each position in the document, which cannot be done quite accurately due to the sparseness of the training data - it is difficult to expect so that each term occurs in each position a sufficient number of times);Tc {\ displaystyle T_ {c}}   - total number of terms in class c documents. When counting, all repeated occurrences are taken into account.

After the classifier is “trained”, that is, the values ​​are foundP^(c) {\ displaystyle {\ hat {P}} (c)}   andP^(t|c) {\ displaystyle {\ hat {P}} (t | c)}   , you can find the document class

cm=argmaxc∈CP^(d|c)P^(c)=argmaxc∈CP^(c)∏k=onendP^(tk|c){\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} \, {\ hat {P}} (d | c) {\ hat {P}} (c) = {\ underset {c \ in C} {\ operatorname {argmax}}} {\ hat {P}} (c) \ prod _ {k = 1} ^ {n_ {d}} {\ hat {P}} ( t_ {k} | c)}  

In order to avoid the bottom overflow in the last formula due to the large number of factors, in practice, the sum of the logarithms is usually used instead of the product. Logarithm does not affect finding the maximum, since the logarithm is a monotonically increasing function. Therefore, in most implementations, instead of the last formula,

cm=argmaxc∈C[log⁡P^(c)+∑k=onendlog⁡P^(tk|c)]{\ displaystyle c_ {m} = {\ underset {c \ in C} {\ operatorname {argmax}}} [\ log {\ hat {P}} (c) + \ sum _ {k = 1} ^ {n_ {d}} \ log {\ hat {P}} (t_ {k} | c)]}  

This formula has a simple interpretation. Chances to classify a document with a frequent class above, and the termlog⁡P^(c) {\ displaystyle \ log {\ hat {P}} (c)}   makes a corresponding contribution to the total amount. The quantities arelog⁡P^(t|c) {\ displaystyle \ log {\ hat {P}} (t | c)}   the more, the more important the term t for identifying class c , and, accordingly, the greater their contribution to the total.

Application

  • spam filtering
  • compiling online catalogs
  • selection of contextual advertising
  • in workflow systems
  • automatic abstracting (annotation)
  • disambiguation in automatic translation of texts
  • search engine restriction
  • definition of the coding and language of the text

Notes

  1. ↑ Manning et al. (2009) - p. 255

Literature

  • Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze An Introduction to Information Retrieval Draft. Online edition. Cambridge University Press. - 2009. - 544 p.

See also

  • Naive Bayes Classifier
  • Clustering
  • Document clustering
  • Semantic mirror

Links

  • Lecture No. 6 on the classification of texts of the course " Modern Problems of Theoretical Informatics " (statement of the problem, construction and training of the classifier, quality assessment).
  • F. Sebastiani. Machine Learning in Automated Text Categorization (PDF). (eng.)
  • "Text mining. Text classification." An example of document classification using STATISTICA software algorithms
Source - https://ru.wikipedia.org/w/index.php?title=Document Classification&oldid = 101398086


More articles:

  • SSE3
  • Brazilian Empire
  • Roger Fenton
  • Faverole
  • Chagodoshchensky District
  • Babaks
  • Galkin, Alexander Alexandrovich (chess player)
  • Pertsev, Alexander Vladimirovich
  • Dogan, Ahmed
  • Symmachus Evionitis

All articles

Clever Geek | 2019