To browse other articles on a range of HSL topics, see the A-Z index.
This entry is out of date, and will not be updated, June 2017
Information retrieval refers to searching for information based on the defined information needs of users. Information retrieval is defined as "the science of searching for information within documents, for metadata which describe documents or within databases whether relational stand-alone databases or hypertext networked databases such as the Internet or intranets". The term is used to refer to searching for text, sound, images, video and data. There is some confusion between data retrieval, document retrieval, information and text retrieval — each has its own literature, theory and praxis. IR systems consist of sets of data or information, one or more indexes, a query interface and search system and a results interface.
The term information retrieval was coined by Calvin Mooers in 1948. Since then, the problem of information storage and retrieval has become more important as we move into the information age. With vast amounts of information, accurate and speedy access to it has become more and more difficult. One effect is that relevant information is hard to locate which in turn leads to duplication of intellectual work. In the digital age, considerable thought has been given to creating intelligent retrieval systems. In libraries, with their information storage and retrieval problems of various kinds, tasks such as cataloguing and general administration have successfully been handled by computers. However, largely because of information fragmentation on the web, the problem of effective retrieval remains.
The principle of information retrieval is straightforward. Where there are documents (in a database) and someone (a searcher) formulates a question (request or query) a set of documents is what is needed and a good retrieval system should produce the needed results. In theory, searchers should be able to find all documents in the database, retain relevant ones and discarding others. In a sense, this constitutes 'perfect' retrieval which is impracticable. A user either does not have the time or does not wish to spend the time reading the entire document collection, apart from the fact that it may be physically impossible for him to do so.
IR is a broad interdisciplinary field drawing upon cognitive psychology, information architecture, information design, human information behaviour, linguistics, semiotics, information science, computer science, librarianship and statistics. Automated information retrieval (IR) systems were envisioned as a way to manage scientific literature. Many universities and public libraries use IR systems to provide access to books, journals and other documents. IR systems are often related to the ideas of 'object and query'. What are queries? They are formal statements of information needs put to an IR system. An object is an entity which keeps or stores information in a database; user queries are matched to documents stored there. In this context, documents are objects, data objects. Often documents are not kept or stored directly in the IR system, but are represented by surrogate records. In 1992, the US Department of Defense, along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. Its aim was to look into the information retrieval community by supplying the infrastructure that was needed to evaluate text retrieval methodologies.
There are various ways to measure how well retrieved information matches intentions: The formulas for precision, recall and fall-out are translated from the german Wikipedia-article "Recall und Precision".
What is precision in information retrieval? It is the proportion of retrieved and relevant documents to all documents retrieved. In MEDLINE, there are a number of techniques that retrievalists can draw upon to improve precision, such as using the focus function; searching for keywords in the titles of articles; using wildcards and truncation, for example. In binary classification, precision is analogous to positive predictive value. Precision is evaluated at a given cut-off rank instead of all retrieved documents. Note that the meaning and usage of "precision" in the field of information retrieval differs from the definition of accuracy and precision within other branches of science, technology and medicine.
What is recall? Recall is the proportion of documents retrieved of all relevant documents available. In binary classification, recall is also referred to as sensitivity.
Fall-Out or false hits
The probability of finding irrelevant citations among retrieved documents is referred to alternately as fall-out, false hits or even 'false drops'.
Open source information retrieval systems
Major information retrieval research groups
Some key websites