Implementing Aggregate Operations and OUTER JOINs
1. Implementing Aggregate Operations
The aggregate operators (MIN, MAX, COUNT, AVERAGE, SUM), when applied to an entire table, can be computed by a table scan or by using an appropriate index, if available. For example, consider the following SQL query:
If an (ascending) B+-tree index on Salary exists for the EMPLOYEE relation, then the optimizer can decide on using the Salary index to search for the largest Salary value in the index by following the rightmost pointer in each index node from the root to the rightmost leaf. That node would include the largest Salary value as its last entry. In most cases, this would be more efficient than a full table scan of EMPLOYEE, since no actual records need to be retrieved. The MIN function can be handled in a similar manner, except that the leftmost pointer in the index is followed from the root to leftmost leaf. That node would include the smallest Salary value as its first entry.
The index could also be used for the AVERAGE and SUM aggregate functions, but only if it is a dense index—that is, if there is an index entry for every record in the main file. In this case, the associated computation would be applied to the values in the index. For a nondense index, the actual number of records associated with each index value must be used for a correct computation. This can be done if the number of records associated with each value in the index is stored in each index entry. For the COUNT aggregate function, the number of values can be also computed from the index in a similar manner. If a COUNT(*) function is applied to a whole relation, the number of records currently in each relation are typically stored in the catalog, and so the result can be retrieved directly from the catalog.
When a GROUP BY clause is used in a query, the aggregate operator must be applied separately to each group of tuples as partitioned by the grouping attribute. Hence, the table must first be partitioned into subsets of tuples, where each partition (group) has the same value for the grouping attributes. In this case, the computation is more complex. Consider the following query:
SELECT Dno, AVG(Salary)
GROUP BY Dno;
The usual technique for such queries is to first use either sorting or hashing on the grouping attributes to partition the file into the appropriate groups. Then the algorithm computes the aggregate function for the tuples in each group, which have the same grouping attribute(s) value. In the sample query, the set of EMPLOYEE tuples for each department number would be grouped together in a partition and the aver-age salary computed for each group.
Notice that if a clustering index (see Chapter 18) exists on the grouping attribute(s), then the records are already partitioned (grouped) into the appropriate subsets. In this case, it is only necessary to apply the computation to each group.
2. Implementing OUTER JOINs
In Section 6.4, the outer join operation was discussed, with its three variations: left outer join, right outer join, and full outer join. We also discussed in Chapter 5 how these operations can be specified in SQL. The following is an example of a left outer join operation in SQL:
SELECT Lname, Fname, Dname
FROM (EMPLOYEE LEFT OUTER JOIN DEPARTMENT ON Dno=Dnumber);
The result of this query is a table of employee names and their associated departments. It is similar to a regular (inner) join result, with the exception that if an
EMPLOYEE tuple (a tuple in the left relation) does not have an associated department, the employee’s name will still appear in the resulting table, but the department name would be NULL for such tuples in the query result.
Outer join can be computed by modifying one of the join algorithms, such as nested-loop join or single-loop join. For example, to compute a left outer join, we use the left relation as the outer loop or single-loop because every tuple in the left relation must appear in the result. If there are matching tuples in the other relation, the joined tuples are produced and saved in the result. However, if no matching tuple is found, the tuple is still included in the result but is padded with NULL value(s). The sort-merge and hash-join algorithms can also be extended to compute outer joins.
Theoretically, outer join can also be computed by executing a combination of relational algebra operators. For example, the left outer join operation shown above is equivalent to the following sequence of relational operations:
1. Compute the (inner) JOIN of the EMPLOYEE and DEPARTMENT tables.
TEMP1 ← πLname, Fname, Dname
2. Find the EMPLOYEE tuples that do not appear in the (inner) JOIN result.
TEMP2 ← πLname, Fname (EMPLOYEE) – πLname, Fname (TEMP1)
3. Pad each tuple in TEMP2 with a NULL Dname field.
TEMP2 ← TEMP2 × NULL
Apply the UNION operation to TEMP1, TEMP2 to produce the LEFT OUTER JOIN result.
RESULT ← TEMP1 ∪ TEMP2
The cost of the outer join as computed above would be the sum of the costs of the associated steps (inner join, projections, set difference, and union). However, note that step 3 can be done as the temporary relation is being constructed in step 2; that is, we can simply pad each resulting tuple with a NULL. In addition, in step 4, we know that the two operands of the union are disjoint (no common tuples), so there is no need for duplicate elimination.