Algorithms for SELECT and JOIN Operations
Implementing the SELECT Operation
There are many algorithms for executing a SELECT operation, which is basically a search operation to locate the records in a disk file that satisfy a certain condition. Some of the search algorithms depend on the file having specific access paths, and they may apply only to certain types of selection conditions. We discuss some of the algorithms for implementing SELECT in this section. We will use the following operations, specified on the relational database in Figure 3.5, to illustrate our dis-cussion:
OP1: σSsn = ‘123456789’ (EMPLOYEE)
OP2: σDnumber > 5 (DEPARTMENT)
OP3: σDno = 5 (EMPLOYEE)
OP4: σDno = 5 AND Salary > 30000 AND Sex = ‘F’ (EMPLOYEE)
OP5: σEssn=‘123456789’ AND Pno =10(WORKS_ON)
Search Methods for Simple Selection. A number of search algorithms are pos-sible for selecting records from a file. These are also known as file scans, because they scan the records of a file to search for and retrieve records that satisfy a selec-tion condition.4 If the search algorithm involves the use of an index, the index search is called an index scan. The following search methods (S1 through S6) are examples of some of the search algorithms that can be used to implement a select operation:
S1—Linear search (brute force algorithm). Retrieve every record in the file, and test whether its attribute values satisfy the selection condition. Since the records are grouped into disk blocks, each disk block is read into a main memory buffer, and then a search through the records within the disk block is conducted in main memory.
S2—Binary search. If the selection condition involves an equality compari-son on a key attribute on which the file is ordered, binary search—which is
more efficient than linear search—can be used. An example is OP1 if Ssn is the ordering attribute for the EMPLOYEE file.5
S3a—Using a primary index. If the selection condition involves an equality comparison on a key attribute with a primary index—for example, Ssn = ‘123456789’ in OP1—use the primary index to retrieve the record. Note that this condition retrieves a single record (at most).
S3b—Using a hash key. If the selection condition involves an equality com-parison on a key attribute with a hash key—for example, Ssn = ‘123456789’ in OP1—use the hash key to retrieve the record. Note that this condition retrieves a single record (at most).
S4—Using a primary index to retrieve multiple records. If the comparison condition is >, >=, <, or <= on a key field with a primary index—for exam-ple, Dnumber > 5 in OP2—use the index to find the record satisfying the cor-responding equality condition (Dnumber = 5), then retrieve all subsequent records in the (ordered) file. For the condition Dnumber < 5, retrieve all the preceding records.
S5—Using a clustering index to retrieve multiple records. If the selection condition involves an equality comparison on a nonkey attribute with a clustering index—for example, Dno = 5 in OP3—use the index to retrieve all the records satisfying the condition.
S6—Using a secondary (B+-tree) index on an equality comparison. This search method can be used to retrieve a single record if the indexing field is a key (has unique values) or to retrieve multiple records if the indexing field is not a key. This can also be used for comparisons involving >, >=, <, or <=.
In Section 19.8, we discuss how to develop formulas that estimate the access cost of these search methods in terms of the number of block accesses and access time. Method S1 (linear search) applies to any file, but all the other methods depend on having the appropriate access path on the attribute used in the selection condition. Method S2 (binary search) requires the file to be sorted on the search attribute. The methods that use an index (S3a, S4, S5, and S6) are generally referred to as index searches, and they require the appropriate index to exist on the search attribute. Methods S4 and S6 can be used to retrieve records in a certain range—for example, 30000 <= Salary <= 35000. Queries involving such conditions are called range queries.
Search Methods for Complex Selection. If a condition of a SELECT operation is a conjunctive condition—that is, if it is made up of several simple conditions connected with the AND logical connective such as OP4 above—the DBMS can use the following additional methods to implement the operation:
S7—Conjunctive selection using an individual index. If an attribute involved in any single simple condition in the conjunctive select condition has an access path that permits the use of one of the methods S2 to S6, use that condition to retrieve the records and then check whether each retrieved record satisfies the remaining simple conditions in the conjunctive select condition.
S8—Conjunctive selection using a composite index. If two or more attrib-utes are involved in equality conditions in the conjunctive select condition and a composite index (or hash structure) exists on the combined fields— for example, if an index has been created on the composite key (Essn, Pno) of the WORKS_ON file for OP5—we can use the index directly.
S9—Conjunctive selection by intersection of record pointers.6 If second-ary indexes (or other access paths) are available on more than one of the fields involved in simple conditions in the conjunctive select condition, and if the indexes include record pointers (rather than block pointers), then each index can be used to retrieve the set of record pointers that satisfy the indi-vidual condition. The intersection of these sets of record pointers gives the record pointers that satisfy the conjunctive select condition, which are then used to retrieve those records directly. If only some of the conditions have
secondary indexes, each retrieved record is further tested to determine whether it satisfies the remaining conditions.7 In general, method S9 assumes that each of the indexes is on a nonkey field of the file, because if one of the conditions is an equality condition on a key field, only one record will satisfy the whole condition.
Whenever a single condition specifies the selection—such as OP1, OP2, or OP3— the DBMS can only check whether or not an access path exists on the attribute involved in that condition. If an access path (such as index or hash key or sorted file) exists, the method corresponding to that access path is used; otherwise, the brute force, linear search approach of method S1 can be used. Query optimization for a SELECT operation is needed mostly for conjunctive select conditions whenever more than one of the attributes involved in the conditions have an access path. The optimizer should choose the access path that retrieves the fewest records in the most efficient way by estimating the different costs (see Section 19.8) and choosing the method with the least estimated cost.
Selectivity of a Condition. When the optimizer is choosing between multiple simple conditions in a conjunctive select condition, it typically considers the selectivity of each condition. The selectivity (sl) is defined as the ratio of the num-ber of records (tuples) that satisfy the condition to the total number of records (tuples) in the file (relation), and thus is a number between zero and one. Zero selec-tivity means none of the records in the file satisfies the selection condition, and a selectivity of one means that all the records in the file satisfy the condition. In gen-eral, the selectivity will not be either of these two extremes, but will be a fraction that estimates the percentage of file records that will be retrieved.
Although exact selectivities of all conditions may not be available, estimates of selectivities are often kept in the DBMS catalog and are used by the optimizer. For example, for an equality condition on a key attribute of relation r(R), s = 1/|r(R)|, where |r(R)| is the number of tuples in relation r(R). For an equality condition on a nonkey attribute with i distinct values, s can be estimated by (|r(R)|/i)/|r(R)| or 1/i, assuming that the records are evenly or uniformly distributed among the distinct values. Under this assumption, |r(R)|/i records will satisfy an equality condition on this attribute. In general, the number of records satisfying a selection condition with selectivity sl is estimated to be |r(R)| * sl. The smaller this estimate is, the higher the desirability of using that condition first to retrieve records. In certain cases, the actual distribution of records among the various distinct values of the attribute is kept by the DBMS in the form of a histogram, in order to get more accurate esti-mates of the number of records that satisfy a particular condition.
Disjunctive Selection Conditions. Compared to a conjunctive selection condition, a disjunctive condition (where simple conditions are connected by the OR logical connective rather than by AND) is much harder to process and optimize. For example, consider OP4 :
OP4 : σDno=5 OR Salary > 30000 OR Sex=‘F’ (EMPLOYEE)
With such a condition, little optimization can be done, because the records satisfying the disjunctive condition are the union of the records satisfying the individual conditions. Hence, if any one of the conditions does not have an access path, we are compelled to use the brute force, linear search approach. Only if an access path exists on every simple condition in the disjunction can we optimize the selection by retrieving the records satisfying each condition—or their record ids—and then applying the union operation to eliminate duplicates.
A DBMS will have available many of the methods discussed above, and typically many additional methods. The query optimizer must choose the appropriate one for executing each SELECT operation in a query. This optimization uses formulas that estimate the costs for each available access method, as we will discuss in Section 19.8. The optimizer chooses the access method with the lowest estimated cost.
Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.