Stamming is the process of finding the basis of a word for a given source word. The base of the word does not necessarily coincide with the morphological root of the word .
The task of finding the basis of the word is a long-standing problem in the field of computer science . The first publication on this subject dates from 1968.. Stemming is used in search engines to expand the user's search query , is part of the text normalization process.
A specific way to solve the problem of finding the basis of words is called a stemming algorithm , and a specific implementation is called a stemmer .
Content
History
The first published stemmer was written by Julie Beth Lovins in 1968 [1] . This article is significant early publication date and had a great influence on later work in this area.
The stemmer was later written by Martin Porter and published in 1980. This stemmer has been very widely used and has become the de facto standard algorithm for texts in English. Dr. Porter received the Strix Award in 2000 for his work on stemming and information retrieval.
Many implementations of Porter's stemming algorithm have been written and freely distributed; however, many of these implementations contain difficult to find flaws. As a result, these algorithms did not work at full strength. To eliminate this type of error, Martin Porter released an official free implementation of the algorithm around 2000. He continued this work over the next several years, developing Snowball , a framework for creating stemming algorithms, and improved English language stamps, as well as some other language stamps.
Algorithms
There are several types of stemming algorithms that differ in terms of performance, accuracy, and how certain problems of stemming are overcome.
Search Algorithms
A simple stemmer searches for an inflected form in the lookup table . The advantages of this approach are its simplicity, speed, and ease of handling exceptions. The disadvantages include the fact that all inflectional forms must be explicitly listed in the table: new or unfamiliar words will not be processed even if they are correct (for example, iPads ~ iPad), and the problem is that the search table can be very big. For languages with simple morphology like English, the size of the tables is small, but for highly inflected languages (for example, Turkish), the table can have hundreds of possible inflectional forms for each root.
Search tables used in stemmers are usually generated in semi-automatic mode. For example, for the English word “run”, the forms “running”, “runs”, “runned” and “runly” will be automatically generated. The last two forms are valid constructions, but they are unlikely to appear in plain text in English.
The search algorithm can use preliminary part-markup to avoid this kind of lemmatization error when different words are assigned to the same lemma (overstemming) [2] .
End Truncation Algorithms
Truncation termination algorithms do not use a look-up table, which consists of inflectional forms and relations of root and form. Instead, as a rule, a smaller list of “rules” is stored, which is used by algorithms, taking into account the form of a word, to find its basis [3] . Some sample rules are as follows:
- if the word ends with 'ed', delete 'ed'
- if the word ends with 'ing', delete 'ing'
- if the word ends with 'ly', delete 'ly'
End truncation algorithms are much more efficient than full brute force algorithms . To develop such algorithms, you need a programmer who is well versed in linguistics , in particular morphology , and also knows how to code “truncation rules”. End truncation algorithms are ineffective for exceptional situations (for example, 'ran' and 'run'). Solutions obtained by ending truncation algorithms are limited to those parts of speech that have well-known endings and suffixes with some exceptions. This is a serious limitation, since not all parts of speech have a well-formulated set of rules. Lemmatization is trying to remove this restriction.
Prefix truncation algorithms can also be implemented. However, not all languages have prefixes and suffixes.
Additional Algorithm Criteria
End truncation algorithms may differ in results for a variety of reasons. One of these reasons is a feature of the algorithm: should there be a word at the output of the algorithm with a real word in a given language. Some approaches do not require the presence of a word in the corresponding language vocabulary . In addition, some algorithms maintain a database of all known morphological roots that exist as real words. These algorithms check for the presence of a term in the database for decision making. As a rule, if the term is not found, alternative actions are performed. These alternative actions may use slightly different criteria to make a decision. A nonexistent term can serve as an alternative to truncation rules.
It may be that two or more truncation rules are applied to the same input term, which creates uncertainty as to which rule to apply. The algorithm can determine the priority of the implementation of such rules (using a person or in a stochastic way). Or, an algorithm may reject one of the rules if it leads to a non-existent term, while the other does not. For example, for the English term “friendlies”, the algorithm can determine the suffix “ies”, apply the corresponding rule and get the result “friendl”. The term "friendl" is most likely not to be found in the lexicon and, therefore, this rule will be rejected.
One of the improvements to the ending truncation algorithms is the use of suffix and ending substitution. Like the truncation rule, the substitution rule replaces the suffix or ending with an alternative. For example, there may be a rule that replaces ies with y. Since truncation rules lead to a nonexistent term in the lexicon, substitution rules solve this problem. In this example, "friendlies" is converted to "friendly" instead of "friendl".
Typically, the application of these rules is cyclical or recursive. After the first substitution rule is applied for this example, the algorithm selects the next rule for the term “friendly”, as a result of which the rule for truncating the suffix “ly” will be revealed and recognized. Thus, the term “friendlies” with the help of the substitution rule becomes the term “friendly”, which, after applying the truncation rule, becomes the term “friend”.
This example helps demonstrate the difference between a rule-based method and brute force . Using exhaustive search, the algorithm will search for the term “friendlies” in a set of hundreds of thousands of inflected word forms and, ideally, will find the corresponding basis for “friend”. In a rule-based method, rules are sequentially executed, resulting in the same solution. Most likely, a rule-based method will be faster.
Affix Stemmers
In linguistics, the most common terms for affixes are suffix and prefix. In addition to approaches that handle suffixes or endings, some of them also handle prefixes. For example, for an English word indefinitely, this method will determine that the “in” construction at the beginning of the word is a prefix and can be removed to obtain the base of the word. Many of the methods mentioned above also use this approach. For example, the ending truncation algorithm can process both suffixes and endings, as well as prefixes, then it will have a different name and follow this approach. Research on affix stemmers for several European languages can be found in the publication ( Jongejan et al 2009 ).
Lemmatization Algorithms
A more complex approach to solving the problem of determining the basis of the word is lemmatization . To understand how lemmatization works, you need to know how different forms of a word are created. Most words change when they are used in various grammatical forms . The end of the word is replaced by the grammatical ending, and this leads to a new form of the original word. Lemmatization performs the inverse transformation: replaces the grammatical ending with a suffix or the end of the initial form [4] .
Lemmatization also includes the definition of a part of a word’s speech and the application of various normalization rules for each part of a speech. The definition of a part of speech occurs before an attempt is made to find a basis, since for some languages the rules of stemming depend on the part of speech of a given word.
This approach is extremely dependent on the exact definition of the lexical category (part of speech). Although there is a partial overlap between the normalization rules for some lexical categories, specifying the wrong category or inability to determine the correct category negates the advantage of this approach over the termination trimming algorithm. The basic idea is that if the stemmer has the ability to get more information about the word being processed, then it can apply more accurate normalization rules.
Ripple-down rules approach
Initially, Ripple-down rules was developed to acquire knowledge and service of rule-based systems. In this approach, knowledge is acquired based on the current context and is gradually added. Rules are created to classify cases appropriate to a specific context.
Unlike standard classification rules, Ripple-down Rules uses exceptions from existing rules, so changes are only related to the context of the rule and do not affect others. Knowledge acquisition tools help you find and change conflicting rules. Here is a simple example of a Ripple-down rule:
if a ^ b then c except if d then e else if f ^ g then hif a ^ b then c except if d then e else if f ^ g then h
if a ^ b then c except if d then e else if f ^ g then h
This rule can be interpreted as follows: “if a and b are true, then we make decision c , except when d is not true. If d is true (exception), then decide e . If a and b are not true, then we move on to another rule and make a decision h , if f and g are true. ” This form of rules very well solves the problem of lemmatization [5] .
To create an exception to the rule, the algorithm must first determine the word that induced the exception. After that, the differences between the two words are determined. The rule exclusion condition will correspond to these differences.
Stochastic Algorithms
Stochastic algorithms are associated with the probabilistic determination of the root form of a word. These algorithms build a probabilistic model and are trained using the correspondence table of root and inflectional forms. This model is usually presented in the form of complex linguistic rules, similar in nature to the rules used in termination trimming and lemmatization algorithms. Stemming is performed by entering modified forms for training the model and generating the root form in accordance with the internal set of model rules, except that decisions related to applying the most appropriate rule or sequence of rules, as well as choosing the basis of the word, are applied based on the fact that the resulting correct word will have the highest probability (incorrect words have the least probability).
Some lemmatization algorithms are stochastic in the sense that a word can belong to several parts of speech with different probabilities. These algorithms can also take into account surrounding words called context. Context-free grammars do not take into account any additional information. In any case, after assigning the probability to each possible part of speech, a part of speech with a higher probability is selected, as well as the corresponding rules for obtaining a normalized form.
Statistical Algorithms
N-gram analysis
Some stemming algorithms use N-gram analysis to select the appropriate basis for the word [6] .
Text Corpus Stemming
One of the main drawbacks of classic stemmers (for example, the Porter stemmer) is that they often do not distinguish between words with similar syntax, but with completely different meanings. For example, “news” and “new” as a result of stemming will be reduced to the basis of “new”, although these words belong to different lexical categories. Another problem is that some stemming algorithms may be suitable for one case and cause too many errors in another. For example, the words “stock,” “stocks,” “stocking,” etc., will have special meaning in the texts of The Wall Street Journal . The main idea of stemming based on the corpus of texts is to create equivalence classes for the words of classical stemmers, and then “break up” some words combined on the basis of their occurrence in the corpus. It also helps to prevent well-known conflict situations of the Porter algorithm, for example, as “policy / police”, since the chance to meet these words together is rather low [7] .
Matching Algorithms
Such algorithms use a database of basics (for example, a set of documents containing the basics of words). These stems do not necessarily correspond to ordinary words, in most cases the strings are a substring (for example, for English, “brows” is a substring in the words “browse” and “browsing”). In order to determine the basis of the word, the algorithm tries to compare it with the foundations from the database, applying various restrictions, for example, on the length of the found basis in the word relative to the length of the word itself (for example, a short prefix “be”, which is the basis of such words, as “be”, “been” and “being” will not be the basis of the word “beside”).
Hybrid Approaches
Hybrid approaches use two or more of the methods described above. A simple example is an algorithm that uses a suffix tree , which at the beginning of its work uses lookup tables to obtain the initial data using exhaustive search. Nevertheless, instead of storing the whole complex of relations between words for a certain language, the search table is used to store a small number of “frequent exceptions” (for example, for English “ran => run”). If the word is not in the exclusion list, ending trimming or lemmatization algorithms are applied to obtain the result.
Languages
Language Features
While most of the early scientific work in this area was focused on the English language (mainly using the Porter stemming algorithm), subsequent works were devoted to many other languages [8] [9] [10] [11] [12] .
Hebrew and Arabic are still considered difficult to study languages in terms of stemming. English stemming algorithms are quite trivial (only sometimes problems arise, for example, the word “dries” is the third-person singular form of the present tense of the verb “dry”, or the word “axes” is the plural of “ax” and “axis”); but it’s becoming increasingly difficult to design stemmers when a more complex target language is chosen, namely a language with more complex morphology and spelling. For example, the stemmers for the Italian language are more complicated than the stemmers for English (due to the large number of inflected forms of verbs), for the Russian language the implementation is even more difficult (a large number of declensions of nouns), for Hebrew even more complex (due to the non-concatenative morphology letters without vowels and the need for truncation prefix algorithms: the Hebrew word stems can be two, three or four characters long, but no more), and so on.
Multilingual stemming algorithms apply the morphological rules of two or more languages at the same time.
Russian language stamming
The Russian language belongs to the group of inflectional synthetic languages, that is, languages in which word formation predominates using affixes that combine several grammatical meanings at once (for example, good - the ending indicates both singular, masculine and nominative cases), therefore this language allows the use of stemming algorithms. The Russian language has a complex morphological mutability of words, which is a source of errors when using stemming. As a solution to this problem, you can use, along with the classical stemming algorithms, lemmatization algorithms that lead words to the initial basic form.
Consider the most popular implementations of the stemmers, based on various principles and allowing the processing of non-existent words for the Russian language.
Porter Stemmer
The main idea of the Porter Stemmer is that there is a limited number of word-forming suffixes, and the word is stamped without using any base bases: only the many existing suffixes and manually set rules.
The algorithm consists of five steps. At each step, the word-forming suffix is cut off and the rest is checked for compliance with the rules (for example, for Russian words, the stem must contain at least one vowel). If the received word satisfies the rules, the next step is taken. If not, the algorithm selects a different suffix for clipping. At the first step, the maximum form-forming suffix is cut off, at the second - the letter "i", at the third - the word-forming suffix, at the fourth - suffixes of excellent forms, "b" and one of the two "n" [13] .
This algorithm often cuts off the word more than necessary, which makes it difficult to get the correct word base, for example, bed-> shelter (in this case, the really unchanging part is the bed , but the stemmer selects the longest morpheme for removal). Also, Porter's stemmer does not cope with all kinds of changes to the root of the word (for example, dropping out and fluent vowels).
Stemka
This stemming algorithm (analyzer) was developed by Andrey Kovalenko in 2002. It is based on a probabilistic model: the words from the training text are parsed into the pairs “last two letters of the stem” + “suffix” and if such a pair is already present in the model, its weight increases, otherwise it is added to the model. After that, the resulting data array is ranked in descending order of weight and the model, the probability of which is less than 1/10000, is cut off. The result - a set of potential endings with conditions for the preceding characters - is inverted for the convenience of scanning word forms “from right to left” and is presented in the form of a transition table of a finite state machine. When parsing, the word is scanned according to the constructed transition tables. A special rule has also been added to the algorithm, stating that an immutable base must contain at least one vowel [14] .
The presented analyzer is available in the source code and can be used in free form subject to reference to the source [15] [16] .
MyStem
Stemmer MyStem was developed by Ilya Segalovich in 1998. Now it is the property of Yandex [17] . At the first step, using the suffix tree in the input word, the possible boundaries between the base and the suffix are determined, after which, for each potential base (starting with the longest) by binary search of the base tree, its presence in the dictionary or finding the closest bases to it (by measure of proximity) is checked is the length of the common "tail"). If the word is dictionary, the algorithm finishes its work; otherwise, it proceeds to the next partition.
If the base variant does not coincide with any of the “closest” vocabulary bases, then this means that the analyzed word with this base variant is absent in the dictionary. Then, on the basis of the existing, suffix and model of the "closest" vocabulary, a hypothetical model for changing this word is generated. The hypothesis is remembered, and if it was already built earlier - it increases its weight. If the word was never found in the dictionary, the length of the required common ending of the stems is reduced by one, the tree is being examined for new hypotheses. When the length of the total “tail” reaches 2, the search stops and the hypotheses are ranked by productivity: if the weight of the hypothesis is five or more times less than the largest weight, such a hypothesis is eliminated. The result of the work of the stemmer is the resulting set of hypotheses for a nonexistent or one hypothesis for a dictionary word [18] .
Stemmer can be used for commercial purposes; exceptions are: the creation and distribution of spam, search engine optimization sites and the development of products and services similar to the services and products of Yandex [17] . Source codes are not distributed [19] . To install, just download and unzip the archive [20] .
Types of Errors
There are two types of errors in stemming algorithms: overstemming ' and understemming . Overstemming is a mistake of the first kind when inflected words are mistakenly assigned to the same lemma. Understemming is a mistake of the second kind , when the morphological forms of one word are assigned to different lemmas. Stamming algorithms try to minimize both of these errors, although the reduction of one type of error can lead to an increase in the other [21] .
Consider these types of errors at the operation of the Porter's stemming algorithm. Overstemming error case : this algorithm compares the words “universal”, “university” and “universe” with the base of “univers”; although these words are etymologically different, their modern meanings refer to different areas, therefore it is incorrect to consider them as synonyms. The case of the understemming error: Porter's algorithm compares words that are derived from the same lemma with different fundamentals, and, therefore, assigns them to different lemmas - “alumnus” → “alumnu”, “alumni” → “alumni”, “alumna” / “ alumnae ”→“ alumna ”(these words retained Latin features in their morphology, but these almost-synonyms were not united by a stemmer).
Application
Stemming is used as an approximate method for grouping words with similar basic meanings. For example, the text that mentions “daffodils” is probably closely related to the text that mentions “daffodil” (without “s”). But in some cases, words with the same base have idiomatic meanings that are almost not related: when a user searches for documents containing “marketing”, documents that mention “markets” but do not contain “marketing” will also be issued (which, most likely does not match the information needs of the user).
Information Search
Stamming is quite common in search engines . However, relatively soon, the effectiveness of stemming in search engines for the English language was recognized as very limited, and this led novice researchers in the field of information search to understand the inapplicability of stemming in the general case [22] [23] . Instead of stemming, search engines can use an approach based on the search for N-grams , not the basics. In addition, recent studies have shown great advantages when searching with N-grams for languages other than English [24] [25] .
Domain Analysis
When analyzing subject areas using stemming, dictionaries of these areas are built [26] .
Use in commercial products
Many commercial companies have already used stemming, at least since the 1980s, and have developed algorithmic and lexical stemmers for many languages [27] [28] .
A comparison of Snowball-Stemmers with commercial. The results were mixed [29] [30] .
The Google search engine began to use stemming since 2003 [31] . Previously, a search for “fish” would not return results containing “fishing”.
See also
- Root (linguistics)
- Morphology (Linguistics)
- Word basis
- Lemmatization
- Token (Linguistics)
- Inflection
- Word formation
- Natural Language Processing
- Text analysis
- Computational Linguistics
Notes
- ↑ Lovins, 1968 , pp. 22-31.
- ↑ Y-stemmer, Viatcheslav Yatsko .
- ↑ Porter et al, 1980 , pp. 130-137.
- ↑ Plisson et al, 2004 , pp. 1-2.
- ↑ Plisson et al, 2004 , pp. 2-3.
- ↑ Smirnov, 2008 , p. 3.
- ↑ Smirnov, 2008 , p. 4-5.
- ↑ Ljiljana et al, 2007 .
- ↑ Jacques, 2006 .
- ↑ Popovič et al, 1992 , pp. 384-390.
- ↑ Anna Tordai et al, 2005 .
- ↑ Viera et al, 2007 , p. 26.
- ↑ Russian stemming algorithm .
- ↑ Gubin et al., 2006 , p. 2-3.
- ↑ NLPub: Stemka .
- ↑ Official site of the Stemka analyzer .
- ↑ 1 2 Mystem License Agreement .
- ↑ Segalovich, 2003 , pp. 4-5.
- ↑ NLPub: Mystem .
- ↑ Official site of Mystem .
- ↑ Paice, 1994 .
- ↑ Baeza-Yates et al, 1999 .
- ↑ Manning et al., 2011 , p. 53-56.
- ↑ Kamps et al, 2004 , pp. 152-165.
- ↑ Airio et al, 2006 , pp. 249-271.
- ↑ Frakes et al, 1998 , pp. 129-141.
- ↑ Language Extension Packs .
- ↑ Building Multilingual Solutions by using Sharepoint Products and Technologies .
- ↑ Stephen Tomlinson, 2003 .
- ↑ Stephen Tomlinson, 2004 .
- ↑ Google Starts Auto Stemming Searches .
Literature
References used
- Lovins, Julie Beth. Development of a Stemming Algorithm (Neopr.) // Mechanical Translation and Computational Linguistics. - 1968 .-- T. 11 .
- Viatcheslav Yatsko. Y-stemmer . Date of treatment January 18, 2014.
- Porter, Martin F. An Algorithm for Suffix Stripping (Neopr.) // Program: electronic library and information systems. - 1980. - T. 14 , No. 3 . Archived on May 28, 2007.
- Joel Plisson, Nada Lavrac, Dunja Mladenić. A Rule based Approach to Word Lemmatization // In the Proceeding of the SiKDD at multiconference IS. - Slovenia, 2004. Archived on October 14, 2014.
- Ilia Smirnov. Overview of stemming algorithms // Mechanical Translation. - 2008. Archived on March 4, 2011.
- Jongejan B., Dalianis H. Automatic Training of Lemmatization Rules that Handle Morphological Changes in pre-, in- and Suffixes Alike (in English) // In the Proceeding of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing. - Singapore, 2009 .-- P. 145-153 .
- Ljiljana Dolamic, Jacques Savoy. Stemming Approaches for East European Languages (English) // CLEF. - 2007.
- Jacques Savoy. Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages (Eng.) // In the Proceeding of the 2006 ACM symposium on Applied computing, SAC. - 2006. - ISBN 1-59593-108-2 .
- Mirko Popovič, Peter Willett. The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data // American Society for Information Science: journal. - 1992. - Vol. 43 , no. 5 .
- Anna Tordai, Maarten de Rijke. Stemming in Hungarian (English) // CLEF. - 2005.
- Viera AFG, Virgil J. Uma revisão dos algoritmos de radicalização em língua portuguesa (port.) // Information Research . - 2007. - Vol. 12 , num. 3 . - P. 315 .
- Chris D. Paice. An evaluation method for stemming algorithms // In the Proceeding of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. - 1994. - P. 42-50 . — ISBN 0-387-19889-X .
- Baeza-Yates R., Ribeiro-Neto B. Modern Information Retrieval. — Addison-Wesley, 1999. — ISBN 0-201-39829-X .
- Маннинг К., Рагхаван П., Шютце Х. Введение в информационный поиск. — Вильямс, 2011. — 512 с. — ISBN 978-5-8459-1623-5 .
- Jaap Kamps, Christof Monz, Maarten de Rijke, Börkur Sigurbjörnsson,. Language-Dependent and Language-Independent Approaches to Cross-Lingual Text Retrieval / Peters C.,Gonzalo J., Braschler M., Kluck M.. — Springer Verlag, 2004.
- Airio, Eija. Word Normalization and Decompounding in Mono- and Bilingual IR (англ.) // Information Retrieval : journal. — 2006. — Vol. 9 .
- Frakes W., Prieto-Diaz R., Fox C. DARE: Domain Analysis and Reuse Environment (неопр.) // Annals of Software Engineering. — 1998. — Т. 5 . Архивировано 19 июня 2010 года.
- dtSearch, Language Extension Packs (англ.) (недоступная ссылка) . Дата обращения 18 января 2014. Архивировано 14 сентября 2011 года.
- Stephen Tomlinson. Lexical and Algorithmic Stemming Compared for 9 European Languages with Hummingbird SearchServer (англ.) // CLEF. - 2003.
- Stephen Tomlinson. Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServer (англ.) // CLEF. — 2004.
- Greg R. Notess. Google Starts Auto Stemming Searches (англ.) (28 November 2003). Дата обращения 18 января 2014.
- Russian stemming algorithm (англ.) . Дата обращения 26 января 2014.
- М.В. Губин, А.Б. Морозов. Влияние морфологического анализа на качество информационного поиска (рус.) . — 2006.
- Ilya Segalovich. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search (англ.) . - 2003.
Further reading
- Dawson, JL (1974); Suffix Removal for Word Conflation , Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33-46
- Frakes, WB (1984); Term Conflation for Information Retrieval , Cambridge University Press
- Frakes, WB & Fox, CJ (2003); Strength and Similarity of Affix Removal Stemming Algorithms , SIGIR Forum, 37: 26-30
- Frakes, WB (1992); Stemming algorithms, Information retrieval: data structures and algorithms , Upper Saddle River, NJ: Prentice-Hall, Inc.
- Hafer, MA & Weiss, SF (1974); Word segmentation by letter successor varieties , Information Processing & Management 10 (11/12), 371—386
- Harman, D. (1991); How Effective is Suffixing? , Journal of the American Society for Information Science 42 (1), 7-15
- Hull, DA (1996); Stemming Algorithms — A Case Study for Detailed Evaluation , JASIS, 47(1): 70-84
- Hull, DA & Grefenstette, G. (1996); A Detailed Analysis of English Stemming Algorithms , Xerox Technical Report
- Kraaij, W. & Pohlmann, R. (1996); Viewing Stemming as Recall Enhancement , in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.); Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18-22 , pp. 40-48
- Krovetz, R. (1993); Viewing Morphology as an Inference Process , in Proceedings of ACM-SIGIR93 , pp. 191—203
- Lennon, M.; Pierce, DS; Tarry, BD; & Willett, P. (1981); An Evaluation of some Conflation Algorithms for Information Retrieval , Journal of Information Science, 3: 177—183
- Lovins, JB (1968); Development of a Stemming Algorithm , Mechanical Translation and Computational Linguistics, 11, 22—31
- Jenkins, Marie-Claire; and Smith, Dan (2005); Conservative Stemming for Search and Indexing
- Paice, CD (1990); Another Stemmer , SIGIR Forum, 24: 56-61
- Popovič, Mirko; and Willett, Peter (1992); The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data , Journal of the American Society for Information Science, Volume 43, Issue 5 (June), pp. 384—390
- Savoy, J. (1993); Stemming of French Words Based on Grammatical Categories Journal of the American Society for Information Science, 44(1), 1-9
- Ulmschneider, John E.; & Doszkocs, Tamas (1983); A Practical Stemming Algorithm for Online Search Assistance (недоступная ссылка) , Online Review, 7(4), 301—318
- Xu, J.; & Croft, WB (1998); Corpus-Based Stemming Using Coocurrence of Word Variants , ACM Transactions on Information Systems, 16(1), 61-81
Links
- NLPub: Stemka (Вероятностный морфологический анализатор русского языка, разработанный Андреем Коваленко) . Дата обращения 27 января 2014.
- Вероятностный морфологический анализатор русского и украинского языков . Дата обращения 27 января 2014.
- NLPub: Mystem . Дата обращения 27 января 2014.
- О программе mystem . Дата обращения 27 января 2014.
- Лицензионное соглашение на использование программы «MyStem» (27 октября 2011). Date of treatment May 12, 2014.
- Apache OpenNLP использует стеммеры Портера и «Snowball»
- SMILE Stemmer — бесплатные онлайн сервис, использующий стеммеры Портера и Paice/Husk Lancaster (Java API)
- Themis — фреймворк с открытым исходным, использующий реализацию стеммера Портера (PostgreSQL, Java API)
- Snowball — бесплатные алгоритмы стемминга для множества языков, включают исходный код, а также стеммеры для пяти романских языков
- Snowball on C# — реализация стеммеров «Snowball» на C# (14 языков)
- Language wrappers — расширение Python для Snowball API
- Ruby-Stemmer — расширение Ruby для Snowball API
- PECL — расширение PHP для Snowball API
- Oleander Porter's algorithm — библиотека стемминга для C++, распространяемая под лицензией BSD
- Unofficial home page of the Lovins stemming algorithm - open source stemmers for multiple languages
- Official home page of the Lancaster stemming algorithm - Lancaster University Stammers, UK
- Overview of stemming algorithms
- PTStemmer - Java / Python / .Net Portuguese Stemmers
- jsSnowball is an open source JavaScript implementation of the Snowball stemming algorithms for many languages
- Snowball Stemmer - implementation in Java
- hindi_stemmer - an open source Stemmer implementation for Hindi
- czech_stemmer - implementation of the open source stemmer for the Czech language