Information retrieval is the process of searching for unstructured documentary information that satisfies information needs [1] , and the science of this search .
History
The term "information retrieval" was first coined by Calvin Muers in 1948 in his doctoral dissertation, published and used in literature since 1950 .
At first, automated IP systems, or information retrieval systems (IPS), were used only to search for scientific information and literature. Many universities and public libraries have begun to use IPS to provide access to books, magazines, and other documents. IPS became widespread with the advent of the Internet and the development of the World Wide Web . The most popular among Russian-speaking users [2] are Yandex , Google search engines .
Information Retrieval as a Process
Information search is the process of identifying in a set of documents ( texts ) all those that are dedicated to a specified topic (subject), satisfy a predetermined search condition ( query ) or contain necessary (relevant to information needs) facts , information, data .
The search process includes a sequence of operations aimed at collecting, processing and providing information.
In general, the search for information consists of four stages:
- definition (clarification) of information needs and the wording of the information request;
- determination of the totality of possible holders of information arrays (sources);
- extracting information from identified information arrays;
- familiarization with the information received and evaluation of search results.
Search Types
Full-text search - search throughout the contents of a document. An example of a full-text search is any Internet search engine, for example www.yandex.ru , www.google.com . Typically, full-text search uses pre-built indexes to speed up the search. The most common technology for full-text search indexes is inverted indexes .
Search by metadata is a search by certain attributes of a document supported by the system - the name of the document, date of creation, size, author, etc. An example of a search by details is a search dialog in the file system (for example, MS Windows ).
Image Search - Search by image content. The search engine recognizes the content of the photo (uploaded by the user or the image URL has been added). In the search results, the user receives similar images. This is how search engines work: Polar Rose , Picollator , etc.
Search Methods
Address Search
The process of searching for documents by purely formal attributes indicated in the request.
The following conditions are necessary for implementation:
- The document has the exact address
- Providing a strict arrangement of documents in a storage device or system storage.
The addresses of documents can be addresses of web servers and web pages and elements of bibliographic records , and addresses of storage of documents in the repository.
Semantic Search
The process of finding documents by their content .
Conditions:
- Translation of the content of documents and requests from a natural language into an information retrieval language and the compilation of search images of a document and a request .
- Preparation of a search description, which indicates an additional search term.
The fundamental difference between address and semantic searches is that in address search, a document is considered as an object in terms of form, and in semantic search in terms of content.
In the semantic search, there are many documents without indicating addresses.
This is the fundamental difference between catalogs and file cabinets .
Library - a collection of bibliographic records without indicating addresses.
Documentary Search
The search process in the repository of the information retrieval system of primary documents or in the database of secondary documents matching the user's request.
Two types of document search:
- Library, aimed at finding primary documents.
- Bibliographic, aimed at finding information about documents presented in the form of bibliographic records.
Factographic Search
The process of finding facts relevant to an information query.
Factographic data include information extracted from documents, both primary and secondary, and obtained directly from the sources of their occurrence.
There are two types:
- Documentary factual, consists in searching documents for fragments of text containing facts.
- Factological (description of facts), involving the creation of new factual descriptions in the search process by logical processing of the found factographic information.
Information Search as a Science
Information retrieval is a large interdisciplinary field of science at the crossroads of cognitive psychology , computer science , information design , linguistics , semiotics , and library science .
Information search is the process of identifying records in an information array that satisfy a predefined search condition or query.
An IP considers searching for information in documents , searching for documents themselves, extracting metadata from documents, searching for text, images, video and sound in local relational databases, in hypertext databases such as the Internet and local intranet systems .
There is some confusion associated with the concepts of data retrieval, document retrieval, information retrieval, and text search. Nevertheless, each of these areas of research has its own methods, practical experience and literature.
Currently, IP is a rapidly developing field of science, the popularity of which is due to the exponential growth of information volumes, in particular on the Internet . IP is devoted to extensive literature and many conferences. One of the most famous is TREC , organized in 1992 by the US Department of Defense in collaboration with the Institute of Standards and Technology ( NIST ) to consolidate the research community and develop methods for assessing the quality of IP.
Request and request object
Speaking about IP systems, the terms query and the request object are used .
A request is a formalized way of expressing information needs of a system user. The language of search queries is used to express information needs, the syntax varies from system to system. In addition to a special query language , modern search engines allow you to enter a query in a natural language .
A request object is an information entity that is stored in the base of an automated search system. Despite the fact that the most common request object is a text document , there are no fundamental restrictions. In particular, it is possible to search for images, music and other multimedia information. The process of entering search objects in the IPS is called indexing . It is far from always that the IPS stores an exact copy of an object; often a substitute is stored instead.
Information Retrieval Tasks
The central task of the IP is to help the user satisfy his informational need. Since it is technically difficult to describe the information needs of the user, they are formulated as some kind of request, which is a set of keywords that characterizes what the user is looking for.
The classic IP task, with which the development of this area began, is the search for documents satisfying the request within a certain static collection of documents. But the list of IP tasks is constantly expanding and now includes:
- Modeling issues;
- Classification of documents ;
- Document filtering ;
- Clustering of documents ;
- Designing search engine architectures and user interfaces ;
- Extracting information, in particular annotating and abstracting documents;
- Request languages , etc.
Also, IP engines are faced with some tasks in processing natural languages , which includes morphological analysis , resolution of lexical ambiguity, and so on.
Performance Evaluation
There are many ways to evaluate how well the documents found by the IPS match the query. Unfortunately, the concept of the degree of compliance of the request, or in other words relevance , is a subjective concept, and the degree of compliance depends on the individual who evaluates the results of the query.
Precision
It is defined as the ratio of the number of relevant documents found by the IPS to the total number of documents found:
- ,
Where - this is a lot of relevant documents in the database, and - a lot of documents found by the system.
Completeness (recall)
The ratio of the number of relevant documents found to the total number of relevant documents in the database:
- ,
Where - this is a lot of relevant documents in the database, and - a lot of documents found by the system.
Fall-out
Dropout characterizes the probability of finding an irrelevant resource and is defined as the ratio of the number of found irrelevant documents to the total number of irrelevant documents in the database:
- ,
Where - this is a lot of not relevant documents in the database, but - a lot of documents found by the system.
F-measure (Van Riesbergen measure)
Sometimes it is useful to combine accuracy and completeness in one averaged value. For this purpose, the arithmetic mean is not suitable, since, for example, it is enough for the search system to return all documents in general to ensure completeness equal to unity with accuracy close to zero, and the arithmetic mean of accuracy and completeness will be no less than 1/2. The harmonic mean does not have this drawback, since with a large difference in the averaged values ββit approaches the minimum of them.
Therefore, a good measure for a joint assessment of accuracy and completeness is the F-measure , which is defined as the weighted harmonic mean of accuracy P and completeness R :
Usually F- measure is written as
At or F- measure gives equal weight to accuracy and completeness and is called balanced or - a measure (in the subscript it is customary to indicate the value ), the expression for it is simplified
The use of a balanced F- measure is optional: when accuracy is preferred fullness gains more weight.
See also
- Digital Libraries
- Full Text Search
- Search engines
- Russian seminar on the assessment of information retrieval methods (ROMIP)
Notes
- β Manning et al, 2011 , pp. 23.
- β Transitions - ANALYZETHIS.RU
Literature
- Baeza-Yates R., Ribeiro-Neto B. Modern Information Retrieval. - Addison-Wesley, 1999. - ISBN 0-201-39829-X .
- Manning C., Raghavan P., SchΓΌtze H. Introduction to Information Retrieval . - Cambridge University Press, 2008 .-- ISBN 0-521-86571-9 .
- Translation: Manning K., Raghavan P., SchΓΌtze H. Introduction to the Information Search. - Williams, 2011 .-- ISBN 978-5-8459-1623-5 .
- Lande D.V., Snarsky A.A. , Bezsudnov I.V. Internet: Navigation in complex networks: models and algorithms . - M .: Librocom (Editorial URSS), 2009 .-- 264 p. - ISBN 978-5-397-00497-8 .
Links
- - community "Information Search" in the " Live Journal "
- Yuri Lifshits. Lecture Course "Algorithms for the Internet"
- Kuralenok I. Ye., Nekrestyanov I. S. Review βAssessment of text search systemsβ