The Czech National Corps (Český národní korpus or ČNK) is a publicly searchable database of written texts in electronic form in Czech, supported by the Charles University in Prague . The site is available in Czech and English.
| Czech National Corps | |
|---|---|
| URL | ucnk.ff.cuni.cz |
| Commercial | not |
| Site type | educational / scientific project |
| Languages) | Czech / English |
| Server location | Czech Republic , Prague |
| Author | |
| Current status | It works and develops |
Content
Creation History
The idea of the CSC was first put forward in 1991 and supported by representatives of the Faculty of Philosophy of Charles University , the Faculty of Mathematics and Physics of Charles University, Masaryk University , Palacki University , the Czech Language Institute of the Czech Academy of Sciences .
The prerequisites for the creation of the corpus were factors such as the deviation of the modern Czech language from generally accepted norms (the creation of the corpus would help rid Czech lexicography of such deviations) and the stabilization of the political situation (wider cooperation with the international scientific community helped bring computer lexicography and corpus linguistics as separate branches, into Czech linguistics). In 1994, the Institute of the Czech National Corps was established at the Faculty of Philosophy of Charles University, and agreements on cooperation between the Institute and several Czech institutions were signed [1] .
Compilers
As of September 10, 2017, the following are working on the Czech National Corps:
- Director Michal Křen
- Deputy Director Václav Cvrček
- Secretary Lucie Novakova ( Lucie Nováková (inaccessible link) )
- Professor František Čermák
- Karel Kučera, professor and head of the section of the diachronic corps
- Head of the Linguistic Section Vaclav Tsvrček
- Head of Computing Section Pavel Vondřichka ( Pavel Vondřička (inaccessible link) )
- The head of the section of the conversational corps Marie Kopřivová
- Head of Linguistic Analysis and Annotation Section Tomáš Jelínek
- Section Head, Parallel Corps Alexander Rosen (inaccessible link )
- and others [2] .
Composition and volume of the casing
| Written corpora (Synchronic) | ~ 2705 million word usage |
| Verbal Corpus / Spoken corpora (synchronic) | ~ 4 million word usage |
| Diachronic case / Diachronic corpora | 1.95 million word usage |
| Foreign language corporation | 6,248 million word uses |
| Parallel Corpus | 92 million word usage |
The total volume of the corpus is over 9 billion word usage, of which ~ 8894.5 million are lemmatized and marked with morphological tags [3] .
Sources of texts
The main contents of CNCs are:
- Texts received electronically from publishing houses and individual owners
- Texts obtained from newspapers (make up the vast majority of the texts of the corps - about 60%)
- Texts of dictionaries (for example, the case FSC2000 refers to the Czech Frequency Dictionary) [1]
A separate corps of the NSC is dedicated to the dystopia of George Orwell "1984" , whose relatively small size (80,000 words and 20,000 punctuation marks) made it possible to manually mark the text almost flawlessly [4] .
Access
There are two types of access on the site: public and full.
An unauthorized user can only search in the SYN2010 corpus, the volume of which is only 100 million words, which is one ninety of the entire base of the Czech National Corps. SYN2010 consists [5] of 40% of fiction, 27% of technical literature, and 33% of journalism. Most of the body texts were created from 2005 to 2009.
Public access allows you to see the number of entries in SYN2010 and the first 50 examples. Words are given in the format of concordance lines, when each line is a part of the text in which a given expression is present. For public access, the use of basic regular expressions is also possible, search by keywords is also possible.
The registered user has full access to the database of the ChNK Institute, as well as to the special manager of the Bonito corps.
Bonito
Bonito (A Modular Corpus Manager Bonito) is a graphical user interface ( GUI ) for the Manatee Corps Manager, created at the Natural Language Processing Center, which is located at the Faculty of Informatics of the Masaryk Institute in Brno. The creator is Pavel Rychlý, assistant of the faculty [6] .
Collaboration
Presently [ specify ] the following Czech institutions cooperate with the corps:
- Institute of Formal and Applied Linguistics and the Faculty of Mathematics and Physics of Charles University , Prague
- Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University , Prague
- Faculty of Informatics Masaryk University , Brno
- Faculty of Education Masaryk University, Brno
- Department of Czech and Slavic Linguistics, Faculty of Philology, Masaryk University, Brno
- Municipal Libraries in Prague
- University of Silesia , Opava
- University of Hradec Kralove
- University of Palacky , Olomouc
- Czech Language Institute of the Czech Academy of Sciences
- and others [7] .
The corps also cooperates with the Faculty of Philosophy and Literature of the University of Granada ( Spain ), the German Language Institute in Mannheim ( Germany ), the University of Mannheim ( Germany ), the University of Granada Amsterdam ( Netherlands ) and other major research centers [7] .
See also
- National Corps of the Russian Language
Notes
- ↑ 1 2 Czech National Corpus (CNC)
- ↑ People | Institute of the Czech National Corpus
- ↑ Available Corpora | Institute of the Czech National Corpus
- ↑ ORWELL | Institute of the Czech National Corpus
- ↑ Public Access Archived October 29, 2013 on Wayback Machine (inaccessible link - history ) Retrieved September 10, 2017.
- ↑ Manatee / Bonito - A Modular Corpus Manager
- ↑ 1 2 Cooperation | Institute of the Czech National Corpus