Clever Geek Handbook
📜 ⬆️ ⬇️

Text Corpus

In linguistics, the corpus (in this meaning the plural - corpuses , not the corpus [1] ) is a collection of texts selected and processed according to certain rules that are used as a basis for studying the language. They are used for statistical analysis and verification of statistical hypotheses , confirmation of linguistic rules in a given language. The corpus of texts is the subject of study of corpus linguistics .

Content

The main properties of the body

Among the many definitions of the case, one can distinguish its main properties :

  • electronic - in the modern sense, the case must be in electronic form
  • representative - should well "represent" the object that models
  • marked - the main difference between the body and the collection of texts
  • pragmatically oriented - should be created for a specific task

Enclosure Classification

You can classify cases according to various criteria: the purpose of creating a case, the type of language data, "literature", genre, dynamism, type of layout, volume of texts, and so on. By the criterion of parallelism , for example, cases can be divided into monolingual, bilingual and multilingual. Multilingual and bilingual are divided into two types:

  1. parallel - many texts and their translations into one or several languages
  2. comparable (pseudo-parallel) - original texts in two or more languages

Enclosure marking

The markup consists in attributing to texts and their components special tags : linguistic and external (extralinguistic). The following linguistic types of markup are distinguished: morphological, semantic, syntactic, anaphoric, prosodic, discourse, etc. Further structural levels of analysis are applied to some cases. In particular, some small cases can be completely syntactically marked out. Such cases are usually called deeply annotated or syntactic , while the syntactic structure itself is a dependency tree . Manual marking (annotation) of texts is an expensive and time-consuming task. At the moment, various tools for marking buildings are openly available [2] . Conventionally, they can be divided into stand-alone and web-based . At the same time, the focus of developers in recent years has shifted towards web applications. These systems have several advantages:

  • the ability to simultaneously mark up one document by several people
  • do not require installation of additional software, except for the browser
  • flexible access rights
  • display the current progress of the markup process
  • the possibility of modifying the marked body

The Internet as a Corps

Modern technologies allow you to create “web cases”, that is, cases obtained by processing Internet sources:

A web corpus is a special kind of linguistic corpus that is created by gradually downloading texts from the Internet using automated procedures that determine the language and encoding of individual web pages on the fly, delete templates, navigation elements, links and advertisements (the so-called boilerplate) Transform to text, filter, normalize and deduplicate the received documents, which can then be processed with traditional corpus linguistics tools (tokenization, myrphosyntactic and syntactic annotation) and implement in the search corpus system. Creating a web corpus is not only much cheaper, but above all its size can even be an order of magnitude larger than traditional corps [3] .

- Vladimѝr Benko ARANEA - FAMILY OF BILLION WEB CASES

Application

Corpus - the basic concept and database of corpus linguistics. Analysis and processing of different types of cases are the subject of most work in the field of computer linguistics (for example, keyword extraction), speech recognition and machine translation , in which cases are often used to create hidden Markov models for marking parts of speech and other tasks. Cases and frequency dictionaries may be useful in teaching foreign languages.

Russian language

  • National Corps of the Russian Language
  • General Internet Corps of the Russian Language
  • Russian-speaking building of the Aranea project
  • Body of Biographical Texts

See also

  • Computational Linguistics
  • Keyword

Notes

  1. ↑ Verification of the word “corpus” on gramota.ru
  2. ↑ Vanyushkin, Grashchenko, 2017 .
  3. ↑ Vladimѝr Benko ARANEA - FAMILY OF BILLION WEB HOUSES

Literature

  • Vanyushkin A. S. , Grashchenko L. A. Evaluation of key word extraction algorithms: tools and resources // New Information Technologies in Automated Systems. - 2017. - No. 20 . - S. 95-102 .
  • Nikolaev I.S., Mitrenina O.V., Lando T.M. Applied and Computational Linguistics - M.URSS, 2016 .-- 320 p.
Source - https://ru.wikipedia.org/w/index.php?title=Text_Case&oldid=101460305


More articles:

  • Dendrolagus pulcherrimus
  • Shvetsov, Peter Mikhailovich
  • Potassium fluorochromate
  • Westendorp, Carlos
  • Arbarchitecture
  • Kashmir Gulman
  • Oxazaphosphorins
  • Kish, Bela
  • La Pruazeler-e-Langl
  • Zhadova, Larisa Alekseevna

All articles

Clever Geek | 2019