Algorithms for SELECT and JOIN Operations
Implementing the SELECT Operation
There are many algorithms for executing a SELECT operation, which is basically a search operation to locate the
records in a disk file that satisfy a certain condition. Some of the search
algorithms depend on the file having specific access paths, and they may apply
only to certain types of selection conditions. We discuss some of the
algorithms for implementing SELECT in this section. We will use the following
operations, specified on the relational database in Figure 3.5, to illustrate
our dis-cussion:
OP1: σSsn
= ‘123456789’ (EMPLOYEE)
OP2: σDnumber
> 5 (DEPARTMENT)
OP3: σDno
= 5 (EMPLOYEE)
OP4: σDno = 5 AND
Salary > 30000 AND Sex
= ‘F’ (EMPLOYEE)
OP5: σEssn=‘123456789’ AND Pno
=10(WORKS_ON)
Search Methods for Simple Selection. A number of search algorithms are pos-sible for
selecting records from a file. These are also known as file scans, because they scan the records of a file to search for
and retrieve records that satisfy a selec-tion condition.4 If the
search algorithm involves the use of an index, the index search is called an index scan. The following search
methods (S1 through S6) are examples of some of the search algorithms that can
be used to implement a select operation:
S1—Linear search (brute force algorithm). Retrieve
every record in the file, and test
whether its attribute values satisfy the selection condition. Since the records
are grouped into disk blocks, each disk block is read into a main memory
buffer, and then a search through the records within the disk block is
conducted in main memory.
S2—Binary search. If the selection condition involves an equality
compari-son on a key attribute on which the file is ordered, binary search—which is
more efficient than linear search—can be used.
An example is OP1 if Ssn is the ordering attribute for the EMPLOYEE file.5
S3a—Using a primary index. If the selection condition involves an equality comparison on a key attribute with a primary index—for example, Ssn = ‘123456789’ in OP1—use the primary index to retrieve the record.
Note that this condition retrieves a single record (at most).
S3b—Using a hash key. If the selection condition involves an equality
com-parison on a key attribute with
a hash key—for example, Ssn = ‘123456789’ in OP1—use the hash key to retrieve the record. Note that this condition
retrieves a single record (at most).
S4—Using a primary index to retrieve multiple
records. If the comparison condition is >, >=, <, or
<= on a key field with a primary index—for exam-ple, Dnumber > 5 in OP2—use the index to find the record satisfying
the cor-responding equality condition (Dnumber = 5), then retrieve all subsequent records in
the (ordered) file. For the condition Dnumber < 5, retrieve all the preceding records.
S5—Using a clustering index to retrieve
multiple records. If the
selection condition involves an
equality comparison on a nonkey
attribute with a clustering index—for example, Dno = 5 in OP3—use the index to retrieve all the records
satisfying the condition.
S6—Using a secondary (B+-tree) index on an equality comparison. This search
method can be used to retrieve a single record if the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is
not a key. This can also be used for comparisons involving >, >=,
<, or <=.
In Section 19.8, we discuss how to develop
formulas that estimate the access cost of these search methods in terms of the
number of block accesses and access time. Method S1 (linear search) applies to any file, but all the other methods
depend on having the appropriate access path on the attribute used in the
selection condition. Method S2 (binary search)
requires the file to be sorted on the search attribute. The methods that use an
index (S3a, S4, S5, and S6) are generally referred to as index searches, and they
require the appropriate index to exist on the search attribute. Methods S4 and S6 can be used to
retrieve records in a certain range—for
example, 30000 <= Salary <= 35000. Queries involving such conditions
are called range queries.
Search Methods for Complex Selection. If a condition of a SELECT operation is a conjunctive
condition—that is, if it is made up of several simple conditions connected
with the AND logical connective such as OP4 above—the DBMS can use the following
additional methods to implement the operation:
S7—Conjunctive selection using an individual
index. If an attribute involved in any single simple condition in the conjunctive select condition has an
access path that permits the use of one of the methods S2 to S6, use that
condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions
in the conjunctive select condition.
S8—Conjunctive selection using a composite
index. If two or more attrib-utes
are involved in equality conditions in the conjunctive select condition and a
composite index (or hash structure) exists on the combined fields— for example,
if an index has been created on the composite key (Essn, Pno) of the WORKS_ON file for OP5—we can use the index directly.
S9—Conjunctive selection by intersection of
record pointers.6 If second-ary indexes (or other access paths)
are available on more than one of the fields involved in simple conditions in
the conjunctive select condition, and if the indexes include record pointers
(rather than block pointers), then each index can be used to retrieve the set of record pointers that satisfy the
indi-vidual condition. The intersection
of these sets of record pointers gives the record pointers that satisfy the
conjunctive select condition, which are then used to retrieve those records
directly. If only some of the conditions have
secondary indexes, each retrieved record is
further tested to determine whether it satisfies the remaining conditions.7
In general, method S9 assumes that each of the indexes is on a nonkey field of the file, because if one
of the conditions is an equality condition on a key field, only one record will
satisfy the whole condition.
Whenever a single condition specifies the
selection—such as OP1, OP2, or OP3— the DBMS can only check whether or not an
access path exists on the attribute involved in that condition. If an access
path (such as index or hash key or sorted file) exists, the method
corresponding to that access path is used; otherwise, the brute force, linear
search approach of method S1 can be used. Query optimization for a SELECT operation is needed mostly for conjunctive select conditions
whenever more than one of the attributes involved in the conditions
have an access path. The optimizer
should choose the access path that retrieves
the fewest records in the most efficient way by estimating the different
costs (see Section 19.8) and choosing the method with the least estimated cost.
Selectivity of a Condition. When the optimizer is choosing between multiple simple conditions in a conjunctive select condition, it typically
considers the selectivity of each
condition. The selectivity (sl) is defined as the ratio of the num-ber
of records (tuples) that satisfy the condition to the total number of records
(tuples) in the file (relation), and thus is a number between zero and one. Zero selec-tivity means none of the
records in the file satisfies the selection condition, and a selectivity of one means that all the
records in the file satisfy the condition. In gen-eral, the selectivity will
not be either of these two extremes, but will be a fraction that estimates the
percentage of file records that will be retrieved.
Although exact selectivities of all conditions
may not be available, estimates of selectivities are often kept in the
DBMS catalog and are used by the optimizer. For example, for an equality condition on a key attribute of relation r(R),
s = 1/|r(R)|, where |r(R)|
is the number of tuples in relation r(R). For an equality condition on a
nonkey attribute with i distinct values,
s can be estimated by (|r(R)|/i)/|r(R)|
or 1/i, assuming that the records are
evenly or uniformly distributed
among the distinct values. Under this assumption, |r(R)|/i records will satisfy an equality
condition on this attribute. In general, the number of records satisfying a
selection condition with selectivity sl
is estimated to be |r(R)| * sl. The smaller this estimate is, the higher the desirability of
using that condition first to retrieve records. In certain cases, the actual
distribution of records among the various distinct values of the attribute is
kept by the DBMS in the form of a histogram,
in order to get more accurate esti-mates of the number of records that satisfy
a particular condition.
Disjunctive Selection Conditions. Compared to a conjunctive selection condition,
a disjunctive condition (where
simple conditions are connected by the OR logical connective rather than by AND) is much harder to process and optimize. For example, consider OP4 :
OP4 : σDno=5 OR
Salary > 30000 OR Sex=‘F’ (EMPLOYEE)
With such a condition, little optimization can
be done, because the records satisfying the disjunctive condition are the union of the records satisfying the
individual conditions. Hence, if any one
of the conditions does not have an access path, we are compelled to use the
brute force, linear search approach. Only if an access path exists on every simple condition in the disjunction
can we optimize the selection by retrieving the records satisfying each
condition—or their record ids—and then applying the union operation to eliminate duplicates.
A DBMS will have available many of the methods discussed above, and typically many additional methods. The query optimizer must choose the appropriate one for executing each SELECT operation in a query. This optimization uses formulas that estimate the costs for each available access method, as we will discuss in Section 19.8. The optimizer chooses the access method with the lowest estimated cost.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.