Czech national corps

Czech National Corps
Czech National Corps
URL	ucnk.ff.cuni.cz
Commercial	not
Site type	educational / scientific project
Languages)	Czech / English
Server location	Czech Republic , Prague
Author
Current status	It works and develops

The Czech National Corps (Český národní korpus or ČNK) is a publicly searchable database of written texts in electronic form in Czech, supported by the Charles University in Prague . The site is available in Czech and English.

Content

Creation History

The idea of the CSC was first put forward in 1991 and supported by representatives of the Faculty of Philosophy of Charles University , the Faculty of Mathematics and Physics of Charles University, Masaryk University , Palacki University , the Czech Language Institute of the Czech Academy of Sciences .

The prerequisites for the creation of the corpus were factors such as the deviation of the modern Czech language from generally accepted norms (the creation of the corpus would help rid Czech lexicography of such deviations) and the stabilization of the political situation (wider cooperation with the international scientific community helped bring computer lexicography and corpus linguistics as separate branches, into Czech linguistics). In 1994, the Institute of the Czech National Corps was established at the Faculty of Philosophy of Charles University, and agreements on cooperation between the Institute and several Czech institutions were signed ^[1] .

Compilers

As of September 10, 2017, the following are working on the Czech National Corps:

Director Michal Křen
Deputy Director Václav Cvrček
Secretary Lucie Novakova ( Lucie Nováková (inaccessible link) )
Professor František Čermák
Karel Kučera, professor and head of the section of the diachronic corps
Head of the Linguistic Section Vaclav Tsvrček
Head of Computing Section Pavel Vondřichka ( Pavel Vondřička (inaccessible link) )
The head of the section of the conversational corps Marie Kopřivová
Head of Linguistic Analysis and Annotation Section Tomáš Jelínek
Section Head, Parallel Corps Alexander Rosen (inaccessible link )
and others ^[2] .

Composition and volume of the casing

Written corpora (Synchronic)	~ 2705 million word usage
Verbal Corpus / Spoken corpora (synchronic)	~ 4 million word usage
Diachronic case / Diachronic corpora	1.95 million word usage
Foreign language corporation	6,248 million word uses
Parallel Corpus	92 million word usage

The total volume of the corpus is over 9 billion word usage, of which ~ 8894.5 million are lemmatized and marked with morphological tags ^[3] .

Sources of texts

The main contents of CNCs are:

Texts received electronically from publishing houses and individual owners
Texts obtained from newspapers (make up the vast majority of the texts of the corps - about 60%)
Texts of dictionaries (for example, the case FSC2000 refers to the Czech Frequency Dictionary) ^[1]

A separate corps of the NSC is dedicated to the dystopia of George Orwell "1984" , whose relatively small size (80,000 words and 20,000 punctuation marks) made it possible to manually mark the text almost flawlessly ^[4] .

Access

There are two types of access on the site: public and full.

An unauthorized user can only search in the SYN2010 corpus, the volume of which is only 100 million words, which is one ninety of the entire base of the Czech National Corps. SYN2010 consists ^{[5] of} 40% of fiction, 27% of technical literature, and 33% of journalism. Most of the body texts were created from 2005 to 2009.

Public access allows you to see the number of entries in SYN2010 and the first 50 examples. Words are given in the format of concordance lines, when each line is a part of the text in which a given expression is present. For public access, the use of basic regular expressions is also possible, search by keywords is also possible.

The registered user has full access to the database of the ChNK Institute, as well as to the special manager of the Bonito corps.

Bonito

Bonito (A Modular Corpus Manager Bonito) is a graphical user interface ( GUI ) for the Manatee Corps Manager, created at the Natural Language Processing Center, which is located at the Faculty of Informatics of the Masaryk Institute in Brno. The creator is Pavel Rychlý, assistant of the faculty ^[6] .

Collaboration

Presently ^{[ specify ] the} following Czech institutions cooperate with the corps:

Institute of Formal and Applied Linguistics and the Faculty of Mathematics and Physics of Charles University , Prague
Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University , Prague
Faculty of Informatics Masaryk University , Brno
Faculty of Education Masaryk University, Brno
Department of Czech and Slavic Linguistics, Faculty of Philology, Masaryk University, Brno
Municipal Libraries in Prague
University of Silesia , Opava
University of Hradec Kralove
University of Palacky , Olomouc
Czech Language Institute of the Czech Academy of Sciences
and others ^[7] .

The corps also cooperates with the Faculty of Philosophy and Literature of the University of Granada ( Spain ), the German Language Institute in Mannheim ( Germany ), the University of Mannheim ( Germany ), the University of Granada Amsterdam ( Netherlands ) and other major research centers ^[7] .

Notes

↑ ¹ ² Czech National Corpus (CNC)
↑ People | Institute of the Czech National Corpus
↑ Available Corpora | Institute of the Czech National Corpus
↑ ORWELL | Institute of the Czech National Corpus
↑ Public Access Archived October 29, 2013 on Wayback Machine (inaccessible link - history ) Retrieved September 10, 2017.
↑ Manatee / Bonito - A Modular Corpus Manager
↑ ¹ ² Cooperation | Institute of the Czech National Corpus

Links

Official website of the corps

[statya-1] ¹ ² Czech National Corpus (CNC)

[2] People | Institute of the Czech National Corpus

[multiple-3] Available Corpora | Institute of the Czech National Corpus

[4] ORWELL | Institute of the Czech National Corpus

[5] Public Access Archived October 29, 2013 on Wayback Machine (inaccessible link - history ) Retrieved September 10, 2017.

[6] Manatee / Bonito - A Modular Corpus Manager

[kitty-7] ¹ ² Cooperation | Institute of the Czech National Corpus