Multimedia Database Concepts
Multimedia databases provide
features that allow users to store and query different types of multimedia information, which includes images (such as photos or drawings), video clips (such as movies, newsreels,
or home videos), audio clips (such as
songs, phone messages, or speeches), and documents
(such as books or articles). The main types of database queries that are needed
involve locating multimedia sources that contain certain objects of interest.
For example, one may want to locate all video clips in a video database that
include a certain person, say Michael Jackson. One may also want to retrieve
video clips based on certain activities included in them, such as video clips
where a soccer goal is scored by a certain player or team.
The above
types of queries are referred to as content-based
retrieval, because the multimedia source is being retrieved based on its
containing certain objects or activities. Hence, a multimedia database must use
some model to organize and index the multimedia sources based on their
contents. Identifying the contents of
multimedia sources is a difficult and time-consuming task. There are two main
approaches. The first is based on automatic
analysis of the multimedia sources to identify certain mathematical
characteristics of their contents. This approach uses different techniques
depending on the type of multimedia source (image, video, audio, or text). The
second approach depends on manual
identification of the objects and activities of interest in each multimedia
source and on using this information to index the sources. This approach can
be applied to all multimedia sources, but it requires a manual preprocessing
phase where a person has to scan each multimedia source to identify and catalog
the objects and activities it contains so that they can be used to index the
sources.
In the
first part of this section, we will briefly discuss some of the characteristics
of each type of multimedia source—images, video, audio, and text/documents.
Then we will discuss approaches for automatic analysis of images followed by
the problem of object recognition in images. We end this section with some
remarks on ana-lyzing audio sources.
An image is typically stored either in raw
form as a set of pixel or cell values, or in compressed form to save space. The
image shape descriptor describes the
geometric shape of the raw image, which is typically a rectangle of cells of a certain width and height.
Hence, each image can be represented by an m
by n grid of cells. Each cell
contains a pixel value that describes the cell content. In black-and-white
images, pixels can be one bit. In gray scale or color images, a pixel is
multiple bits. Because images may require large amounts of space, they are
often stored in compressed form. Compression standards, such as GIF, JPEG, or MPEG,
use various mathematical transformations to reduce the number of cells stored
but still maintain the main image characteristics. Applicable mathematical
transforms include Discrete Fourier Transform (DFT), Discrete Cosine Transform
(DCT), and wavelet transforms.
To
identify objects of interest in an image, the image is typically divided into
homogeneous segments using a homogeneity
predicate. For example, in a color image, adjacent cells that have similar
pixel values are grouped into a segment. The homogeneity predicate defines
conditions for automatically grouping those cells. Segmentation and compression
can hence identify the main characteristics of an image.
A typical
image database query would be to find images in the database that are similar
to a given image. The given image could be an isolated segment that contains,
say, a pattern of interest, and the query is to locate other images that
contain that same pattern. There are two main techniques for this type of
search. The first approach uses a distance
function to compare the given image with the stored images and their
segments. If the distance value returned is small, the probability of a match
is high. Indexes can be created to group stored images that are close in the
distance metric so as to limit the search space. The second approach, called
the transformation approach,
measures image similarity by having a small number of transformations that can change one image’s cells to match the
other image. Transformations include rotations, translations, and scaling.
Although the transformation approach is more general, it is also more
time-consuming and difficult.
A video source is typically represented
as a sequence of frames, where each frame is a still image. However, rather
than identifying the objects and activities in every individual frame, the
video is divided into video segments,
where each segment comprises a sequence of contiguous frames that includes the
same objects/activities. Each segment is identified by its starting and ending
frames. The objects and activities identified in each video segment can be
used to index the segments. An indexing technique called frame segment trees has been proposed for video indexing. The index
includes both objects, such as persons, houses, and cars, as well as
activities, such as a person delivering
a speech or two people talking.
Videos are also often compressed using standards such as MPEG.
Audio sources include stored recorded messages,
such as speeches, class presentations, or even surveillance recordings of
phone messages or conversations by law enforcement. Here, discrete transforms
can be used to identify the main characteristics of a certain person’s voice
in order to have similarity-based indexing and retrieval. We will briefly comment
on their analysis in Section 26.4.4.
A text/document source is basically the
full text of some article, book, or magazine. These sources are typically
indexed by identifying the keywords that appear in the text and their relative
frequencies. However, filler words or common words called stopwords are eliminated from the process. Because there can be
many keywords when attempting to index a collection of documents, techniques
have been developed to reduce the number of keywords to those that are most
relevant to the collection. A dimensionality reduction technique called singular value decompositions (SVD),
which is based on matrix transformations, can be used for this purpose. An
indexing technique called telescoping
vector trees (TV-trees), can then be used to group similar documents.
Chapter 27 discusses document processing in detail.
1. Automatic Analysis
of Images
Analysis of multimedia sources is critical to support any type of query
or search interface. We need to represent multimedia source data such as images
in terms of features that would enable us to define similarity. The work done
so far in this area uses low-level visual features such as color, texture, and
shape, which are directly related to the perceptual aspects of image content. These
features are easy to extract and represent, and it is convenient to design
similarity measures based on their statistical properties.
Color is one of the most widely used visual features in content-based image retrieval since it does not depend
upon image size or orientation. Retrieval based on color similarity is mainly
done by computing a color histogram for each image that identifies the
proportion of pixels within an image for the three color channels (red, green,
blue—RGB). However, RGB representation
is affected by the orientation of the object with respect to illumination and
camera direction. Therefore, current image retrieval techniques compute color
histograms using competing invariant representations such as HSV (hue, saturation, value). HSV
describes colors as points in a cylinder whose central axis ranges from black
at the bottom to white at the top with neutral colors between them. The angle
around the axis corresponds to the hue, the distance from the axis corresponds
to the saturation, and the distance along the axis corresponds to the value
(brightness).
Texture refers to the patterns in an image that present the properties of
homogeneity that do not result from the presence of a single color or
intensity value. Examples of texture classes are rough and silky. Examples of
textures that can be identified include pressed calf leather, straw matting,
cotton canvas, and so on. Just as pictures are represented by arrays of pixels
(picture elements), textures are represented by arrays of texels (texture elements). These textures are then placed
into a number of sets, depending on how many textures are identified in the
image. These sets not only contain the texture definition but also indicate
where in the image the texture is located. Texture identification is primarily
done by modeling it as a two-dimensional, gray-level variation. The relative
brightness of pairs of pixels is computed to estimate the degree of contrast,
regularity, coarseness, and directionality.
Shape refers to the shape of a region within an image. It is generally
determined by applying segmentation
or edge detection to an image. Segmentation
is a region-based approach that uses an entire region (sets of pixels), whereas
edge detection is a boundary-based
approach that uses only the outer boundary characteristics of entities. Shape
representation is typically required to be invariant to translation, rotation,
and scaling. Some well-known methods for shape representation include Fourier
descriptors and moment invariants.
2. Object Recognition
in Images
Object recognition is the
task of identifying real-world objects in an image or a video sequence. The system must be able to identify the object
even when the images of the object vary in viewpoints, size, scale, or even
when they are rotated or translated. Some approaches have been developed to
divide the original image into regions based on similarity of contiguous
pixels. Thus, in a given image showing a tiger in the jungle, a tiger subimage
may be detected against the background of the jungle, and when compared with a
set of training images, it may be tagged as a tiger.
The
representation of the multimedia object in an object model is extremely
important. One approach is to divide the image into homogeneous segments using
a homogeneous predicate. For example, in a colored image, adjacent cells that
have similar pixel values are grouped into a segment. The homogeneity predicate
defines conditions for automatically grouping those cells. Segmentation and
compression can hence identify the main characteristics of an image. Another
approach finds measurements of the object that are invariant to
transformations. It is impossible to keep a database of examples of all the
different transformations of an image. To deal with this, object recognition
approaches find interesting points (or features) in an image that are invariant
to transformations.
An
important contribution to this field was made by Lowe, who used
scale-invariant features from images to perform reliable object recognition.
This approach is called scale-invariant
feature transform (SIFT). The SIFT features are invariant to image scaling
and rotation, and partially invariant to change in illumi-nation and 3D camera
viewpoint. They are well localized in both the spatial and frequency domains,
reducing the probability of disruption by occlusion, clutter, or noise. In
addition, the features are highly distinctive, which allows a single feature to
be correctly matched with high probability against a large database of features,
providing a basis for object and scene recognition.
For image
matching and recognition, SIFT features (also known as keypoint features) are
first extracted from a set of reference images and stored in a database. Object recognition is then performed by
comparing each feature from the new image with the features stored in the
database and finding candidate matching fea-tures based on the Euclidean
distance of their feature vectors. Since the keypoint features are highly
distinctive, a single feature can be correctly matched with good probability in
a large database of features.
In
addition to SIFT, there are a number of competing methods available for object
recognition under clutter or partial occlusion. For example, RIFT, a rotation invari-ant generalization
of SIFT, identifies groups of local affine regions (image features having a
characteristic appearance and elliptical shape) that remain approximately
affinely rigid across a range of views of an object, and across multiple
instances of the same object class.
3. Semantic Tagging of
Images
The notion of implicit tagging is an important one for image recognition
and comparison. Multiple tags may attach to an image or a subimage: for
instance, in the example we referred to above, tags such as “tiger,” “jungle,”
“green,” and “stripes” may be associated with that image. Most image search
techniques retrieve images based on user-supplied tags that are often not very
accurate or comprehensive. To improve search quality, a number of recent systems
aim at automated generation of these image tags. In case of multimedia data,
most of its semantics is present in its content. These systems use
image-processing and statistical-modeling techniques to analyze image content
to generate accurate annotation tags that can then be used to retrieve images
by content. Since different annotation schemes will use different vocabularies
to annotate images, the quality of image retrieval will be poor. To solve this
problem, recent research techniques have proposed the use of concept
hierar-chies, taxonomies, or ontologies using OWL (Web Ontology Language), in which terms and their relationships
are clearly defined. These can be used to infer higher-level concepts based on
tags. Concepts like “sky” and “grass” may be further divided into “clear sky”
and “cloudy sky” or “dry grass” and “green grass” in such a taxonomy. These
approaches generally come under semantic tagging and can be used in conjunction
with the above feature-analysis and object-identification strategies.
4. Analysis of Audio
Data Sources
Audio sources are broadly classified into speech, music, and other audio
data. Each of these are significantly different from the other, hence different
types of audio data are treated differently. Audio data must be digitized
before it can be processed and stored. Indexing and retrieval of audio data is
arguably the toughest among all types of media, because like video, it is
continuous in time and does not have easily measurable characteristics such as
text. Clarity of sound recordings is easy to perceive humanly but is hard to
quantify for machine learning. Interestingly, speech data often uses speech
recognition techniques to aid the actual audio content, as this can make
indexing this data a lot easier and more accurate. This is sometimes referred
to as text-based indexing of audio data.
The speech metadata is typically content dependent, in that the metadata is
generated from the audio content, for example, the length of the speech, the
number of speakers, and so on. However, some of the metadata might be
independent of the actual content, such as the length of the speech and the
format in which the data is stored. Music indexing, on the other hand, is done
based on the statistical analysis of the audio signal, also known as content-based indexing. Content-based
indexing often makes use of the key features
of sound: intensity, pitch, timbre, and rhythm. It is possible to compare
different pieces of audio data and retrieve information from them based on the
calculation of certain features, as well as application of certain transforms.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.