Intellectual analysis of texts ( IAT , English text mining ) is a direction in artificial intelligence , the purpose of which is to obtain information from collections of text documents , based on the use of effective in practical terms, methods of machine learning and processing of natural language . The name “intellectual analysis of texts” echoes the notion of “ data mining ” ( IAD , English data mining ), which expresses the similarity of their goals, approaches to information processing and applications; the difference is manifested only in the final methods, and also in the fact that the IDA deals with storages and databases , and not with electronic libraries and text boxes .
Content
IAT Task Groups
The key groups of IAT tasks are: categorization of texts, information retrieval and information retrieval , processing changes in collections of texts, as well as the development of means of presenting information to the user. [one]
The categorization of documents consists in assigning documents from the collection to one or several groups (classes, clusters) of similar texts (for example, by theme or style). Categorization can occur with the participation of a person, and without it. In the first case, called the classification of documents , the IAT system must relate the texts to already defined (convenient for him) classes. In terms of machine learning, this requires training with a teacher , for which the user must provide the IAT system with both a multitude of classes and samples of documents belonging to these classes.
The second case of categorization is called document clustering . At the same time, the IAT system should itself determine the set of clusters over which texts can be distributed — in machine learning, the corresponding task is called instruction without a teacher . In this case, the user must inform the IAT system the number of clusters into which he would like to split the collection being processed (it is understood that the procedure of feature selection is already included in the program algorithm).
Application
Recently, text analysis has attracted more and more attention in various areas such as security, commerce, and science.
Safe
Many text analysis packages, such as Aerotext and Attensity , target the market for security applications, in particular, the analysis of simple text sources, such as news sites.
In software
Research and development units of large companies, such as IBM , Apple and Microsoft , are exploring text analysis technology to further automate the analysis and data extraction processes.
Notes
- ↑ Berry, 2003 , p. xi.
Literature
In Russian:
- Peskova O. V. Algorithms for classification of full-text documents // Automatic processing of texts in natural language and computational linguistics. - Moscow : MIEM (Moscow State Institute of Electronics and Mathematics), 2011. - P. 170—212. - ISBN 978–5–94506–294–8.
In English:
- Survey of Text Mining I: Clustering, Classification, and Retrieval / Ed. by MW Berry. - 2004. - Springer, 2003. - 261 p. - ISBN 0387955631 .
- Aggarwal CC, Zhai C. Mining Text Data. - Springer, 2012. - 527 p. - ISBN 9781461432234 .
- Do Prado HA Emerging Technologies for Text Mining: Techniques and Applications / Ed. by HA Do Prado, E. Ferneda. - Idea Group Reference, 2007. - 358 p. - ISBN 1599043734 .