Some General Issues Concerning Indexing
1. Logical versus
Physical Indexes
In the earlier discussion, we have assumed that the index entries <K, Pr>
(or <K, P>) always include a physical pointer Pr (or P) that specifies
the physical record address on disk
as a block number and offset. This is sometimes called a physical index, and it
has the disadvantage that the pointer must be changed if the record is moved to another disk location. For
example, suppose that a primary file organization is based on linear hashing
or extendible hashing; then, each time a bucket is split, some records are allocated
to new buckets and hence have new physical addresses. If there was a secondary
index on the file, the pointers to those records would have to be found and
updated, which is a difficult task.
To remedy
this situation, we can use a structure called a logical index, whose index entries are of the form <K, Kp>.
Each entry has one value K for the
secondary indexing field matched with the value Kp of the field used for the primary file organiza-tion.
By searching the secondary index on the value of K, a program can locate the corresponding value of Kp and use this to access the
record through the primary file organization. Logical indexes thus introduce an
additional level of indirection between the access structure and the data. They
are used when physical record addresses are expected to change frequently. The
cost of this indirection is the extra search based on the primary file
organization.
2. Discussion
In many
systems, an index is not an integral part of the data file but can be created
and discarded dynamically. That is why it is often called an access structure. Whenever we expect to
access a file frequently based on some search condition involving a particular
field, we can request the DBMS to create an index on that field. Usually, a
secondary index is created to avoid physical ordering of the records in the
data file on disk.
The main
advantage of secondary indexes is that—theoretically, at least—they can be
created in conjunction with virtually any
primary record organization. Hence, a secondary index could be used to
complement other primary access methods such as ordering or hashing, or it
could even be used with mixed files. To create a B+-tree secondary
index on some field of a file, we must go through all records in the file to
create the entries at the leaf level of the tree. These entries are then sorted
and filled according to the specified fill factor; simultaneously, the other
index levels are created. It is more expensive and much harder to create
primary indexes and clustering indexes dynamically, because the records of the
data file must be physically sorted on disk in order of the indexing field.
However, some systems allow users to create these indexes dynamically on their
files by sorting the file during index creation.
It is
common to use an index to enforce a key
constraint on an attribute. While searching the index to insert a new
record, it is straightforward to check at the same time whether another record
in the file—and hence in the index tree—has the same key attribute value as the
new record. If so, the insertion can be rejected.
If an
index is created on a nonkey field, duplicates
occur; handling of these duplicates is an issue the DBMS product vendors have
to deal with and affects data storage as well as index creation and
management. Data records for the duplicate key may be contained in the same
block or may span multiple blocks where many duplicates are possible. Some
systems add a row id to the record so that records with duplicate keys have
their own unique identifiers. In such cases, the B+-tree index may
regard a <key, Row_id> combination as the defacto key for the index,
turning the index into a unique index with no duplicates. The deletion of a key
K from such an index would involve
deleting all occurrences of that key K—hence
the deletion algorithm has to account for this.
In actual
DBMS products, deletion from B+-tree indexes is also handled in
various ways to improve performance and response times. Deleted records may be
marked as deleted and the corresponding index entries may also not be removed
until a garbage collection process reclaims the space in the data file; the
index is rebuilt online after garbage collection.
A file
that has a secondary index on every one of its fields is often called a fully inverted file. Because all indexes are secondary, new records are
inserted at the end of the file;
therefore, the data file itself is an unordered (heap) file. The indexes are
usually implemented as B+-trees, so they are updated dynamically to
reflect insertion or deletion of records. Some commercial DBMSs, such as
Software AG’s Adabas, use this method extensively.
We
referred to the popular IBM file organization called ISAM in Section 18.2.
Another IBM method, the virtual storage
access method (VSAM), is
somewhat sim-ilar to the B+–tree access structure and is still being
used in many commercial systems.
3. Column-Based Storage
of Relations
There has
been a recent trend to consider a column-based storage of relations as an
alternative to the traditional way of storing relations row by row. Commercial
relational DBMSs have offered B+-tree indexing on primary as well
as secondary keys as an efficient mechanism to support access to data by
various search criteria and the ability to write a row or a set of rows to disk
at a time to produce write-optimized systems. For data warehouses (to be
discussed in Chapter 29), which are read-only databases, the column-based
storage offers particular advantages for read-only queries. Typically, the
column-store RDBMSs consider storing each column of data individually and
afford performance advantages in the following areas:
1. Vertically
partitioning the table column by column, so that a two-column table can be
constructed for every attribute and thus only the needed columns can be
accessed
2. Use of
column-wise indexes (similar to the bitmap indexes discussed in Section 18.5.2)
and join indexes on multiple tables to answer queries with-out having to access
the data tables
3. Use of
materialized views (see Chapter 5) to support queries on multiple columns
Column-wise
storage of data affords additional freedom in the creation of indexes, such as
the bitmap indexes discussed earlier. The same column may be present in
multiple projections of a table and indexes may be created on each projection.
To store the values in the same column, strategies for data compression,
null-value suppression, dictionary encoding techniques (where distinct values
in the column are assigned shorter codes), and run-length encoding techniques
have been devised. MonetDB/X100, C-Store, and Vertica are examples of such
systems. Further discussion on column-store DBMSs can be found in the
references mentioned in this chapter’s Selected Bibliography.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.