Home | | **Database Management Systems** | | **FUNDAMENTALS OF Database Systems** | | **Database Management Systems** | Retrieval Models

1. Boolean Model
2. Vector Space Model
3. Probabilistic Model
4. Semantic Model

**Retrieval Models**

In this section we briefly describe the important models of IR. These
are the three main statistical models—Boolean, vector space, and
probabilistic—and the semantic model.

**1. Boolean Model**

In this model, documents are represented as a set of *terms*. Queries are formulated as a
combination of terms using the standard Boolean logic set-theoretic operators
such as AND, OR and NOT. Retrieval and relevance are considered as binary concepts in this
model, so the retrieved elements are an “exact match” retrieval of relevant
documents. There is no notion of ranking of resulting documents. All retrieved
documents are considered equally important—a major simplification that does not
consider frequencies of document terms or their proximity to other terms com-pared
against the query terms.

Boolean retrieval models lack sophisticated ranking algorithms and are
among the earliest and simplest information retrieval models. These models make
it easy to associate metadata information and write queries that match the
contents of the documents as well as other properties of documents, such as
date of creation, author, and type of document.

**2. Vector Space Model**

The
vector space model provides a framework in which term weighting, ranking of
retrieved documents, and relevance feedback are possible. Documents are
represented as *features* and *weights* of term features in an *n*-dimensional vector space of terms. **Features** are a subset of the terms in a
*set of documents* that are deemed most
relevant to an IR search for this particular set of documents. The process of
selecting these important terms (features) and their properties as a sparse
(limited) list out of the very large number of available terms (the vocabulary
can contain hundreds of thousands of terms) is independent of the model
specification. The query is also specified as a terms vector (vector of
features), and this is compared to the document vectors for
similarity/relevance assessment.

The
similarity assessment function that compares two vectors is not inherent to the
model—different similarity functions can be used. However, the cosine of the
angle between the query and document vector is a commonly used function for
similarity assessment. As the angle between the vectors decreases, the cosine
of the angle approaches one, meaning that the similarity of the query with a
document vector increases. Terms (features) are weighted proportional to their
frequency counts to reflect the importance of terms in the calculation of
relevance measure. This is different from the Boolean model, which does not
take into account the frequency of words in the document for relevance match.

In the
vector model, the *document term weight w _{ij}*
(for term

In the
formula given above, we use the following symbols:

*d _{j}
*is the
document vector.

*q *is the
query vector.

*w _{ij}
*is the
weight of term

*w _{iq}
*is the
weight of term

|*V*| is the number of dimensions in the
vector that is the total number of important keywords (or features).

TF-IDF uses the product of normalized
frequency of a term *i* (*TF _{ij}*) in document

IDF
values can be easily computed for a fixed collection of documents. In case of
Web search engines, taking a representative sample of documents approximates
IDF computation. The following formulas can be used:

In these
formulas, the meaning of the symbols is:

*TF _{ij}
*is the
normalized term frequency of term

*f _{ij}
*is the
number of occurrences of term

*IDF _{i}
*is the
inverse document frequency weight for term

*N *is the
number of documents in the collection.

*n _{i}
*is the
number of documents in which term

Note that
if a term *i* occurs in all documents,
then *n _{i}* =

Sometimes,
the relevance of the document with respect to a query (rel(*D _{j}*,

The
normalization factor (similar to the denominator of the cosine formula) is
incorporated into the TF-IDF formula itself, thereby measuring relevance of a
document to the query by the computation of the dot product of the query and
document vectors.

The
Rocchio algorithm is a well-known relevance feedback algorithm based on the
vector space model that modifies the initial query vector and its weights in
response to user-identified relevant documents. It expands the original query
vector *q *to a new vector* q _{e} *as follows:

Here, *D _{r}* and

**3. Probabilistic Model**

The similarity measures in the vector space model are somewhat ad hoc.
For example, the model assumes that those documents closer to the query in
cosine space are more relevant to the query vector. In the probabilistic model,
a more concrete and definitive approach is taken: ranking documents by their
estimated probability of relevance with respect to the query and the document.
This is the basis of the *Probability
Ranking Principle *developed by Robertson:^{}

In the probabilistic framework, the IR system has to decide whether the
documents belong to the **relevant set**
or the **nonrelevant** set for a query.
To make this decision, it is assumed that a predefined relevant set and
nonrelevant set exist for the query, and the task is to calculate the probability
that the document belongs to the relevant set and compare that with the
probability that the document belongs to the nonrelevant set.

Given the
document representation *D* of a
document, estimating the relevance *R*
and nonrelevance *NR* of that document
involves computation of conditional prob-ability *P*(*R*|*D*) and *P*(*NR*|*D*).
These conditional probabilities can be calculated using Bayes’ Rule:^{}

*P*(*R*|*D*) =* P*(*D*|*R*)*
*×* P*(*R*)/*P*(*D*)

*P*(*NR*|*D*) =* P*(*D*|*NR*)*
*×* P*(*NR*)/*P*(*D)*

A
document *D* is classified as relevant
if *P*(*R|D*) > *P*(*NR|D*). Discarding the constant *P*(*D*),
this is equivalent to saying that a document is relevant if:

*P*(*D|R*)* *×* P*(*R*) >* P*(*D|NR*)* *×* P*(*NR*)

The
likelihood ratio *P*(*D|R*)/*P*(*D|NR*) is used as a score to determine
the likelihood of the document with representation *D* belonging to the relevant set.

The *term independence* or *Naïve Bayes* assumption is used to
estimate *P*(*D|R*) using computation of *P*(*t _{i}*|

With some
reasonable assumptions and estimates about the probabilistic model along with
extensions for incorporating query term weights and document term weights in
the model, a probabilistic ranking algorithm called **BM25** (Best Match 25) is quite popular. This weighting scheme has
evolved from several versions of the **Okapi**** **system.

The Okapi
weight for Document *d _{j}* and
query

*t _{i} *is a term.

*f _{ij} *is the raw frequency count of
term

*f _{iq} *is the raw frequency count of
term

*N *is the total number of documents
in the collection.

*df _{i} *is the number of documents that
contain the term

*dl _{j} *is the document length (in bytes)
of

*avdl *is the average document length of
the collection.

The Okapi
relevance score of a document *d _{j}*
for a query

**4. Semantic Model**

However sophisticated the above statistical models become, they can miss
many relevant documents because those models do not capture the complete
meaning or information need conveyed by a user’s query. In semantic models, the
process of matching documents to a given query is based on concept level and
semantic matching instead of index term (keyword) matching. This allows
retrieval of relevant documents that share meaningful associations with other
documents in the query result, even when these associations are not inherently
observed or statistically captured.

Semantic approaches include different levels of analysis, such as
morphological, syntactic, and semantic analysis, to retrieve documents more
effectively. In **morphological analysis**,
roots and affixes are analyzed to determine the parts of** **speech (nouns, verbs, adjectives, and so on) of the words.
Following morphological analysis, **syntactic
analysis** follows to parse and analyze complete phrases in docu-ments.
Finally, the semantic methods have to resolve word ambiguities and/or generate
relevant synonyms based on the **semantic
relationships** between levels of structural entities in documents (words,
paragraphs, pages, or entire documents).

The development of a sophisticated semantic system requires complex
knowledge bases of semantic information as well as retrieval heuristics. These
systems often require techniques from artificial intelligence and expert
systems. Knowledge bases like Cyc and WordNet have been developed for use in *knowledge-based
IR systems *based on semantic models. The Cyc knowledge base, for example,
is a represen-tation of a vast quantity of commonsense knowledge about
assertions (over 2.5 million facts and rules) interrelating more than 155,000
concepts for reasoning about the objects and events of everyday life. WordNet
is an extensive thesaurus (over 115,000 concepts) that is very popular and is
used by many systems and is under continuous development (see Section 27.4.3).

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

**Related Topics **

Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.