Clever Geek Handbook
πŸ“œ ⬆️ ⬇️

Wiktionary

Wiktionary ( English Wiktionary ) - freely updated multi-functional multilingual dictionary and thesaurus , based on the wiki engine . One of the projects of the Wikimedia Foundation . First appeared in English on December 12, 2002 .

Wiktionary
English Wiktionary
Wiktionary-logo-ru-2013.png
URLwiktionary.org
CommercialNot
Site typeOnline dictionary
check inOptional
Languages)170
Server locationMiami
OwnerWikimedia Foundation
AuthorJimmy Wales
Beginning of work
Alexa rating
β–Ό 549 (September 9, 2017) [1]

The dictionary contains grammatical descriptions, interpretations and translations of words. In addition, articles may reflect information about the etymology , phonetic properties and semantic links of words. Thus, Wiktionary is an attempt to combine grammatical , explanatory , etymological and multilingual dictionaries as well as a thesaurus in one product.

Wiktionaries data are actively used in solving various tasks related to computer processing of text and speech.

Content

Lexicographic concept

Due to the relationship between different language sections of Wiktionary, as well as between participants in vocabulary and other projects of the Wikimedia Foundation, participants in each of them can use the concepts, tools and lexicographical materials created by their counterparts in other languages. In the course of work on various language sections of the dictionary, a complex concept of a universal lexicographic resource has developed, which became possible for the first time thanks to electronic technologies. The concept ultimately implies a complete, comprehensive description of all the lexical units of all natural (and basic artificial) languages ​​that have written language. Completeness of the description means the availability of information about phonetics, morphology, syntactic and semantic properties of a lexical unit, its etymology, compatibility and phraseology. The completeness and degree of consistency in the implementation of this concept may vary in different language sections of the project.

In each language section, the β€œtitle” language is central - all articles are written exclusively in it, and the goal is to provide translations of words and other units of this language into as many other languages ​​as possible. Words of other languages ​​are translated, as a rule, only into this β€œtitle” language. Thus, in the Russian Wiktionary, interpretations and translations into foreign languages ​​are given for Russian words; translations into Russian are given for foreign words instead of interpretations.

In describing morphology, an attempt is made to give the most complete picture of inflection, including an indication of the class of inflection. In particular, morphological information on Russian lexemes is given in accordance with the classification proposed by A. A. Zaliznyak .

An extensive list of references has been created to supplement Wiktionary; in the English Wiktionary, rules have been developed for including the term in the dictionary (see Criteria for inclusion ). Unlike the Russian Wikipedia , where the priority in the selection of material is given to authoritative sources [Note 1] , in the Russian Wiktionary the word usage analysis [Note 2] carried out by the editor of the article prevails.

Thesaurus

The wiktionary contains the following semantic relations: synonyms , antonyms , hyperonyms , hyponyms , hyponyms , holonyms , meonyms , and paronyms .

Wikipedia and Wiktionary

Wiktionary does not include a detailed description of the facts and encyclopedic information. However, Wiktionary provides unique information not found on Wikipedia: phrases, sayings, abbreviations, acronyms, spelling errors, simplified / distorted spelling / pronouncing, controversial usage, protopologism , onomatopey , various styles (eg, conversational) and subject areas [2] . Thus, Wikipedia and Wiktionary complement each other.

Wiktionary is similar to Wikipedia in that (1) there are internal links to articles about words within Wiktionary, (2) there are categories, (3) there are interwiki referring to articles about the same word in a foreign language dictionary [2] .

Linking projects

Wikipedia members are encouraged to add a wiktionary template to articles (for example, {{wiktionary | wiktionary}}) to link to the corresponding wiktionary article. To make a link back to the Wiktionary page, use the β€œWikipedia” template (for example, {{Wikipedia | Wikipedia}}).

The use of such templates smooths the sharp corners of the problem β€œencyclopedia or dictionary” and makes access to information more convenient because it provides a link to additional linguistic information about the term in the encyclopedia, and, conversely, gives a link to a deep description of the word in the dictionary, improving the whole connectedness of articles in Wikimedia Foundation projects.

If you want to specify a link to the definition of the word directly in the text of the article (the β€œwiktionary” template adds a whole block), inter-project inter- links are used, which are defined as follows: [[wikt:ru:слово|слово]] or in short [[:wikt:слово|]] and look like this: word .

Russian section

Dynamics of the development of the Russian Wiktionary

The Russian Wiktionary section was created in the spring of 2004 . For one and a half years, it practically did not develop, being replenished haphazardly, mainly with poor-quality material. The situation began to change in late 2005 - early 2006 .

In 2006, the first administrator was appointed, the volume of articles increased almost four times compared with the previous year, powerful tools were created for describing morphology, and a developed system of semantic categories began to take shape.

By the autumn of 2006, the number of articles in the Russian Wiktionary reached 10,000; then, thanks to the creation of a bot that uses the vocabulary of other sections of Wiktionary to generate blanks in the Russian section, in a month and a half about 70,000 articles were added. On November 7, 2006, Wiktionary overcame the mark of 80,000, and on December 10, 2006, the mark of 100,000 articles was taken. On December 17, 2018, the number of articles exceeded 1,000,000. The number of active participants was about 230.

In contrast to the situation with traditional dictionaries, the fullness of Wiktionary cannot be adequately evaluated by the formal indicator of the number of articles. The automatic counter does not distinguish between half-empty blanks and truly informative articles, in addition, it does not take into account intra-language and inter-language homonymy. For example, the vocabulary entry boron is listed as one article, meanwhile this article describes several homonymous lexemes of the Russian language, as well as similar lexemes of other languages ​​(Bulgarian, Tatar) - in traditional dictionaries this material would have been designed and taken into account in the form of several articles .

Comparison with other Wiktionaries

 
The number of Russian words in the Russian Wiktionary (left) and in the English Wiktionary (right) [3] , data for 2011

Since August 2008, the Russian Wiktionary has come out on top by the size of the database among all the Wiktionaries [4] . At the same time, the number of articles in the Russian Wiktionary is not the largest [5] . This is partly due to the fact that for projects in which there are more articles than in the Russian Wiktionary, articles can be on average smaller in size, as can be seen on the statistics website [6] . In addition, the Russian Wiktionary as compared with other sections of Wiktionary contains a greater amount of supporting information, including reference tables, lists of frequency words, etc. (unlike dictionary entries that make up the so-called main namespace, this information is placed in the sections "," Indices ", etc.). A significant number of articles in the Russian Wiktionary are still blanks generated by bots. Although sometimes you can meet the criticism of a large number of articles, blanks, such preliminary marking provides many advantages. Firstly, it helps to create articles faster by first including some information, such as a part of the speech of the described word. Secondly, the structure of articles is standardized. Due to the widespread use of templates (which are usually immediately put down by bots when automatically creating articles), it is possible to centrally change the appearance of many articles at once. The presence of a large number of templates also helps to carry out further automated editing of already created articles β€” for example, automatically putting the translation into pre-prepared dictionaries (as it is easier for bots to navigate the structure of an article already marked up with specialized constructions rather than human language). A distinctive feature of the Russian Wiktionary is a developed development concept (which can be found on the main page). Due to the elaborated concept and widespread use of templates, articles in the Russian Wiktionary look more of the same type than in many other projects (the number of sections, the order of their sequence, the design of each section are basically the same).

The authors [3] calculated the number of entries for Russian words, the number of articles with and without interpretation - in two Wiktionaries (in the illustration). The policy of the editors of the English Wiktionary (not to create articles-blanks) was confirmed: only 5.57% of dictionary entries about Russian words without interpretation. In the Russian Wiktionary such articles - 60.39%. However, in the Russian Wiktionary (as of 2011), there are almost 3.4 times more dictionary entries with interpretations for Russian words than in the English Wiktionary: 53.6 thousand versus 15.7 thousand.

Using wiktionaries in automatic text and speech processing tasks

To use lexicographic data of wiktionaries in solving problems of automatic text processing and speech , it is necessary to convert the texts of dictionary entries ( semi-structured data [7] ) into a machine-readable format [8] [9] [10] .

Retrieving data from wiktionaries is not an easy task. The following difficulties can be distinguished [11] : (1) regular and frequent changes in both the data and the structure of articles, (2) different wiktionaries have different structure and format of articles [Note 3] , (3) wiki technology is initially focused on usability man, not on machine processing.

There are several parsers for different wiktionaries [12] :

  • DBpedia Wiktionary is one of the extensions of the DBpedia project, data is extracted from the English, French, German and Russian Wiktionaries. Extracted: language, part of speech, interpretation, semantic relations, translations. To retrieve data, the following are used: declarative description of the structure of the dictionary entry [13] , regular expressions [14], and FST, a kind of finite automaton [15] .
  • JWKTL (Java Wiktionary Library) - API to the data of English and German Wiktionaries [16] . Extracted: language, part of speech, interpretation, quotations, semantic relations, etymology and translations. The program is available for non-commercial use.
  • wikokit is a parser for the English and Russian Wiktionaries [17] . Extracts: language, part of speech, interpretation, quotes [18] (only for Russian Wiktionary), semantic relations [19] and translations. The source code of the program is available under the terms of open multilicense .

With the help of wiktionaries, various tasks related to text and speech processing are solved [20] :

  • machine translation based on rules between Dutch and Afrikaans ; the data from the English and Dutch Wiktionaries and two Wikipedias are used within the Apertium system [21] ;
  • creation of a machine-readable dictionary with the NULEX parser integrating open linguistic resources: English Wiktionary, WordNet and VerbNet [22] . For the noun from the English Wiktionary, the part of speech and the plural form were extracted, for the verbs - tense. Screen scraping was used to extract data from Wiktionary;
  • speech recognition and synthesis , where Wiktionary acts as a data source for automatically constructing a vocabulary of pronunciations [23] . Word-pronunciation pairs (transcription in the MFA system) are extracted from the Czech, English, French, German, Polish and Spanish Wiktionaries [Note 4] . When checking, the largest number of errors was found in transcriptions extracted from the English Wiktionary [24] ;
  • building ontologies [25] and knowledge bases [26] ;
  • ontology mapping [27] ;
  • simplified text . In [28] , an assessment of the complexity of words is performed based on Wiktionary data. For a word from the English Wiktionary, the following are extracted: the size of the dictionary entry, the number of parts of speech, the number of values ​​and the number of translations. The authors of [28] suggested that the simpler, basic ones would be those words that have more meanings (that is, the size of the article will be larger), more parts of speech and more translations. Further, β€œcomplex” words found in the text should be rephrased, more β€œsimple” equivalents should be found, which will lead to simplification (adaptation) of the text;
  • Frequent marking . In the work (Lee et al., 2012) [29], based on the data of the English Wiktionary, POS-taggers for eight languages ​​with β€œpoor linguistic resources” were built using hidden Markov models . [Note 5]
  • text tonality analysis [30] .

See also

  • Russian Wiktionary
  • Tatoeba

Notes

Comments
  1. ↑ Wikipedia: Authoritative Sources

    Wikipedia articles should be based on published authoritative sources .

  2. Wiktionary: Lexicographic Concept

    If there are disagreements regarding any described properties of any language unit, priority (from the point of view of evidence) is given to corpus sources.

  3. ↑ Compare, for example, the structure and rules for the design of articles in the English Wiktionary and the Russian Wiktionary .
  4. ↑ If a dictionary entry has several transcriptions, the first one is taken.
  5. ↑ The source code of the program and the results of part of the markup are available online: https://code.google.com/p/wikily-supervised-pos-tagger
Sources
  1. ↑ Global ranking site Wiktionary (English) . Alexa Internet . The appeal date is September 9, 2017.
  2. ↑ 1 2 Zesch et al., 2008 , p. 2
  3. ↑ 1 2 Smirnov et al., 2012 .
  4. ↑ Wiktionary Statistics: Database Size
  5. ↑ Wiktion statistics
  6. ↑ Wiktionary statistics: Bytes per article
  7. ↑ Meyer and Gurevych, 2012 , p. 140.
  8. ↑ Zesch et al, 2008 , Figure 1, p. four.
  9. ↑ Meyer and Gurevych, 2010 , p. 40
  10. ↑ Krizhanovsky, Transformation, 2010 , p. one.
  11. ↑ Hellmann and Auer, 2013 , p. 16 in PDF, p. 302.
  12. ↑ Hellmann et al, 2012 , Table 1, p. 3
  13. ↑ Hellmann et al., 2012 , pp. 8-9.
  14. ↑ Hellmann et al., 2012 , p. ten.
  15. ↑ Hellmann et al., 2012 , p. eleven.
  16. ↑ Zesch et al, 2008 .
  17. ↑ Krizhanovsky, Transformation, 2010 .
  18. ↑ Krizhanovsky, 2011 .
  19. ↑ Krizhanovsky, Comparison, 2010 .
  20. ↑ Smirnov et al., 2012 , pp. 233–234.
  21. ↑ Otte and Tyers, 2011 .
  22. ↑ McFate and Forbus, 2011 .
  23. ↑ Schlippe et al., 2012 .
  24. ↑ Schlippe et al., 2012 , p. 4804.
  25. ↑ Meyer and Gurevych, 2012 .
  26. ↑ ConceptNet 5 (Unsolved) . The appeal date is April 17, 2013. Archived April 19, 2013.
  27. ↑ Lin and Krizhanovsky, 2011 .
  28. ↑ 1 2 Medero and Ostendorf, 2009 .
  29. ↑ Li et al, 2012 .
  30. ↑ Chesley et al, 2006 .

Literature

  • Krizhanovsky A. Transformation of the structure of the Wiktionary dictionary entry into tables and relational database relations : preprint. - 2010.
  • Krizhanovsky A. Comparison of the thesauruses of the Russian and English Wiktionaries, transformed into a machine-readable format : preprint. - 2010.
  • Krizhanovsky A. Assessment of the use of buildings and digital libraries in the Russian Wiktionary // Proceedings of the international conference "Corpus linguistics-2011". - SPb. : St. Petersburg State University, Faculty of Philology, 2011. - p. 217–222. - 348 s. - ISBN 978-5-8465-0005-5 .
  • A. Smirnov, V. Kruglov, A. A. Krizhanovsky, N. B. Lugovaya, A. Karpov, I. Kipyatkova. Quantitative analysis of the Russian WordNet vocabulary and wiktionaries // SPIIRAN Proceedings. - SPb. , 2012. - V. 23. - P. 231–253.
  • Chesley P., Vincent B., Li Xu, Srihari RK Using blog to automatically classify blog sentiment // Training. - 2006. - T. 580. - p. 233-235.
  • Hellmann S., Brekle J., Auer S. Leveraging the Linguistic Data Cloud : Proc. Joint Int. Semantic Technology Conference (JIST), Dec 2-4. - Nara, Japan, 2012.
  • Hellmann S., Auer S. Towards Web-Scale Collaborative Knowledge Extraction // The People's Web Meets NLP / Gurevych, Iryna; Kim, Jungi. - Springer, 2013. - p. 287-313. - 378 s. - (Theory and Applications of Natural Language Processing). - ISBN 978-3-642-35084-9 .
  • Li S., GraΓ§a JV, Taskar B. Wiki-ly supervised tagging : Proceedings of the 2012 - Jeju Island, Korea: Association for Computational Linguistics, 2012. - p . 1389–1398 . Archived May 22, 2013.
  • Lin F., Krizhanovsky A. Multilingual ontology matching based on Wiktionary data accessible via SPARQL endpoint // Proc. of the 13th Russian Conference on Digital Libraries RCDL'2011. October 19-22, Voronezh, Russia. - 2011. - pp. 19β€”26.
  • McFate C., Forbus K. NULEX: An Open-Broadcasting License Coverage Report for the International Association for Computational Lawyers, 19-24 June, 2011, Portland, Oregon, USA - Short Papers. - The Association for Computer Linguistics, 2011. - p. 363-367. - ISBN 978-1-932432-88-6 .
  • Medero J. and Ostendorf M. Analysis of vocabulary difficulty using wiktionary // Proc. SLaTE Workshop. - 2009.
  • Others Resource - A Comparative Study of Wiktionary, OpenThesaurus and GermaNet : Proc. 11th International Conference on Intelligent Text Processing and Computational Linguistics ,. - Iasi, Romania, 2010. - pp . 38-49 . Archived December 1, 2017.
  • Meyer CM and Gurevych I. OntoWiktionary - Constructing an Ontology from the Collaborative Online Dictionary Wiktionary // Semi-Automatic Ontology Development: Processes and Resources / MT Pazienza and A. Stellato. - IGI Global, 2012. - p. 131-161. - ISBN 978-1-4666-0188-8 .
  • Otte P., Tyrs FM Rapid rule-based machine translation between Dutch and Afrikaans // EAMT 2011: proc. 15th Conference of the European Association for Machine Translation / Mikel L. Forcada, Heidi Depraetere, Vincent Vandeghinste. - Leuven, Belgium, 2011. - p. 153-160.
  • Schlippe T., Ochs S., Schultz T. Grapheme-to-phone model generation for Indo-European languages // In Proceedings of the 37th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2012), Kyoto, Japan, 25 -30 March. - 2012. - p. 4801-4804.
  • Zesch T., MΓΌller C., Gurevych I. Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary. : Proc. of the 6th International Conference on Resources and Evaluation. - Marrakech, Morocco, 2008.

Links

  • Wiktionary front page
  • Meta: Main Page - OmegaWiki
Source - https://ru.wikipedia.org/w/index.php?title=Wiktionary&oldid=101163766


More articles:

  • Vostochny (Mosty District)
  • Beyning, Albert Frederick Hendrick
  • Pallopteridae
  • BMK-460
  • Pavlovsky Village Council (Volnyansky District)
  • Taphrina betulae
  • Dawson, Bertrand
  • Molotino (Bryansk region)
  • Miara, Joe
  • Akasete, Santiago

All articles

Clever Geek | 2019