Information retrieval

History of Information Retrieval

  • Vannevar Bush: "As we May Think"

    Vannevar Bush: "As we May Think"
    "In view of current concerns, the problem is not so much that excessive publications are made as they have far exceeded our present capacity to make real use of them (...) Professionally our methods of transmitting and reviewing the results of the scientific research are several generations old and, for now, totally inadequate in its purpose ...
  • Period: to

    Period I: start of the use of the computer in IR

  • Calvin Mooers introduced the term "information retrieval"

  • H. P. Luhn: proposed to use words as units of indexing

    He proposed to use words as units of indexing for documents and to measure the superposition of words as a criterion of recovery.
  • Cranfield Institute of Technology: marked the beginning of the IR as an empirical discipline

    The Cranfield Institute of Technology and other associated entities, tests began that marked the beginning of the recovery of information as an empirical discipline. These tests made a strong influence on the evolution of the discipline. With them, an evaluation methodology was developed that is still in use by the IR systems nowadays.
  • Period: to

    Period II: 1960s decade

  • Gerard Salton: VSM, TF, cosine similarity

    Gerard Salton: VSM, TF, cosine similarity
    The group (of Harvard and Cornell universities), managed by Salton, produced numerous technical reports, establishing ideas and concepts still important research areas today. Areas as the formalization of algorithms to classify documents about a query, an approach in which documents and queries were visualized as vectors within an n-dimensional space, and later, the similarity between a document and the query vector, to be measured through the cosine of the angle between the two vectors.
  • Period: to

    Period III: 1970s decade

    One of the key developments of this period was the weighting of the frequency of terms (TF) of Luhn (based on the occurrence of words within a document), complemented by the work of Sparck Jones on the occurrence of words in the documents of a collection. Likewise, Salton synthesized the results of his group’s work on vectors to produce the vector space model.
  • Karen Spärck Jones: TF-IDF

    Karen Spärck Jones: TF-IDF
  • Roberston & Sparck: Probabilistic Model

    Roberston & Sparck: Probabilistic Model
    An alternative means of modeling IR systems involved expanding the idea of Maron et al. [86] to use probability theory. Robertson defined the principle of probability classification, which determined how to classify best documents based on probabilistic measures concerning the defined evaluation measures.
  • Martin F Porter: stemming

    Martin F Porter: stemming
    Creation of new stemming algorithms, the process of matching words to their lexical variants, which, although they were known since 1960, had an important improvement with the contribution of Porter and other authors, which are still used today.
  • Period: to

    Period IV: the decade of the 80s and mid-90s of the last century

  • Deerwester et al.: LSI

  • Text Retrieval Conference TREC

    An initiative of Voorhees and Harman, as an annual exercise in which numerous international research groups collaborate to build test datasets larger than those that existed before. With the large collections of text available under TREC, many old techniques were modified, new ones were developed, and are still being developed for effective recovery.
  • Robertson et al.: BM25

  • Period: to

    Period V: the mid-1990s to today

  • Brin & Page: PageRank

    Brin & Page: PageRank
  • Ponte & Croft: language model

  • Jon M. Kleinberg: HITS

    Jon M. Kleinberg: HITS