An Overview of Database Tuning in Relational Systems
After a database is deployed and is in operation, actual use of the applications, transactions, queries, and views reveals factors and problem areas that may not have been accounted for during the initial physical design. The inputs to physical design listed in Section 20.1.1 can be revised by gathering actual statistics about usage patterns. Resource utilization as well as internal DBMS processing—such as query optimiza-tion—can be monitored to reveal bottlenecks, such as contention for the same data or devices. Volumes of activity and sizes of data can be better estimated. Therefore, it is necessary to monitor and revise the physical database design constantly—an activ-ity referred to as database tuning. The goals of tuning are as follows:
To make applications run faster.
To improve (lower) the response time of queries and transactions.
To improve the overall throughput of transactions.
The dividing line between physical design and tuning is very thin. The same design decisions that we discussed in Section 20.1.2 are revisited during database tuning, which is a continual adjustment of the physical design. We give a brief overview of the tuning process below.3 The inputs to the tuning process include statistics related to the same factors mentioned in Section 20.1.1. In particular, DBMSs can internally collect the following statistics:
Sizes of individual tables.
Number of distinct values in a column.
The number of times a particular query or transaction is submitted and executed in an interval of time.
The times required for different phases of query and transaction processing (for a given set of queries or transactions).
These and other statistics create a profile of the contents and use of the database. Other information obtained from monitoring the database system activities and processes includes the following:
Storage statistics. Data about allocation of storage into tablespaces, index-spaces, and buffer pools.
I/O and device performance statistics. Total read/write activity (paging) on disk extents and disk hot spots.
Query/transaction processing statistics. Execution times of queries and transactions, and optimization times during query optimization.
Locking/logging related statistics. Rates of issuing different types of locks, transaction throughput rates, and log records activity.
Index statistics. Number of levels in an index, number of noncontiguous leaf pages, and so on.
Some of the above statistics relate to transactions, concurrency control, and recovery, which are discussed in Chapters 21 through 23. Tuning a database involves deal-ing with the following types of problems:
How to avoid excessive lock contention, thereby increasing concurrency among transactions.
How to minimize the overhead of logging and unnecessary dumping of data.
How to optimize the buffer size and scheduling of processes.
How to allocate resources such as disks, RAM, and processes for most efficient utilization.
Most of the previously mentioned problems can be solved by the DBA by setting appropriate physical DBMS parameters, changing configurations of devices, changing operating system parameters, and other similar activities. The solutions tend to be closely tied to specific systems. The DBAs are typically trained to handle these tuning problems for the specific DBMS. We briefly discuss the tuning of various physical database design decisions below.
1. Tuning Indexes
The initial choice of indexes may have to be revised for the following reasons:
Certain queries may take too long to run for lack of an index.
Certain indexes may not get utilized at all.
Certain indexes may undergo too much updating because the index is on an attribute that undergoes frequent changes.
Most DBMSs have a command or trace facility, which can be used by the DBA to ask the system to show how a query was executed—what operations were performed in what order and what secondary access structures (indexes) were used. By analyzing these execution plans, it is possible to diagnose the causes of the above problems. Some indexes may be dropped and some new indexes may be created based on the tuning analysis.
The goal of tuning is to dynamically evaluate the requirements, which sometimes fluctuate seasonally or during different times of the month or week, and to reorganize the indexes and file organizations to yield the best overall performance. Dropping and building new indexes is an overhead that can be justified in terms of performance improvements. Updating of a table is generally suspended while an index is dropped or created; this loss of service must be accounted for. Besides drop-ping or creating indexes and changing from a nonclustered to a clustered index and vice versa, rebuilding the index may improve performance. Most RDBMSs use B+-trees for an index. If there are many deletions on the index key, index pages may contain wasted space, which can be claimed during a rebuild operation. Similarly, too many insertions may cause overflows in a clustered index that affect performance. Rebuilding a clustered index amounts to reorganizing the entire table ordered on that key.
The available options for indexing and the way they are defined, created, and reorganized varies from system to system. As an illustration, consider the sparse and dense indexes in Chapter 18. A sparse index such as a primary index (see Section 18.1) will have one index pointer for each page (disk block) in the data file; a dense index such as a unique secondary index will have an index pointer for each record. Sybase provides clustering indexes as sparse indexes in the form of B+-trees, whereas INGRES provides sparse clustering indexes as ISAM files and dense clustering indexes as B+-trees. In some versions of Oracle and DB2, the option of setting up a clustering index is limited to a dense index (with many more index entries), and the DBA has to work with this limitation.
2. Tuning the Database Design
In Section 20.1.2, we discussed the need for a possible denormalization, which is a departure from keeping all tables as BCNF relations. If a given physical database design does not meet the expected objectives, the DBA may revert to the logical database design, make adjustments such as denormalizations to the logical schema, and remap it to a new set of physical tables and indexes.
As discussed, the entire database design has to be driven by the processing requirements as much as by data requirements. If the processing requirements are dynamically changing, the design needs to respond by making changes to the conceptual schema if necessary and to reflect those changes into the logical schema and physical design. These changes may be of the following nature:
Existing tables may be joined (denormalized) because certain attributes
from two or more tables are frequently needed together: This reduces the normalization level from BCNF to 3NF, 2NF, or 1NF.5
For the given set of tables, there may be alternative design choices, all of which achieve 3NF or BCNF. We illustrated alternative equivalent designs in Chapter 16. One normalized design may be replaced by another.
A relation of the form R(K,A, B, C, D, ...)—with K as a set of key attributes— that is in BCNF can be stored in multiple tables that are also in BCNF—for example, R1(K, A, B), R2(K, C, D, ), R3(K, ...)—by replicating the key K in each table. Such a process is known as vertical partitioning. Each table groups sets of attributes that are accessed together. For example, the table
EMPLOYEE(Ssn, Name, Phone, Grade, Salary) may be split into two tables: EMP1(Ssn, Name, Phone) and EMP2(Ssn, Grade, Salary). If the original table has a large number of rows (say 100,000) and queries about phone numbers and salary information are totally distinct and occur with very different fre-quencies, then this separation of tables may work better.
Attribute(s) from one table may be repeated in another even though this cre-ates redundancy and a potential anomaly. For example, Part_name may be replicated in tables wherever the Part# appears (as foreign key), but there may be one master table called PART_MASTER(Part#, Part_name, ...) where the Partname is guaranteed to be up-to-date.
Just as vertical partitioning splits a table vertically into multiple tables, horizontal partitioning takes horizontal slices of a table and stores them as distinct tables. For example, product sales data may be separated into ten tables based on ten product lines. Each table has the same set of columns (attributes) but contains a distinct set of products (tuples). If a query or transaction applies to all product data, it may have to run against all the tables and the results may have to be combined.
These types of adjustments designed to meet the high volume of queries or transac-tions, with or without sacrificing the normal forms, are commonplace in practice.
3. Tuning Queries
We already discussed how query performance is dependent upon the appropriate selection of indexes, and how indexes may have to be tuned after analyzing queries that give poor performance by using the commands in the RDBMS that show the execution plan of the query. There are mainly two indications that suggest that query tuning may be needed:
A query issues too many disk accesses (for example, an exact match query scans an entire table).
The query plan shows that relevant indexes are not being used.
Some typical instances of situations prompting query tuning include the following:
Many query optimizers do not use indexes in the presence of arithmetic expressions (such as Salary/365 > 10.50), numerical comparisons of attrib-utes of different sizes and precision (such as Aqty = Bqty where Aqty is of type INTEGER and Bqty is of type SMALLINTEGER), NULL comparisons (such as Bdate IS NULL), and substring comparisons (such as Lname LIKE ‘%mann’).
Indexes are often not used for nested queries using IN; for example, the fol-
SELECT Ssn FROM EMPLOYEE
WHERE Dno IN ( SELECT Dnumber FROM DEPARTMENT
WHERE Mgr_ssn = ‘333445555’ );
may not use the index on Dno in EMPLOYEE, whereas using Dno = Dnumber in the WHERE-clause with a single block query may cause the index to be used.
Some DISTINCTs may be redundant and can be avoided without changing the result. A DISTINCT often causes a sort operation and must be avoided as much as possible.
Unnecessary use of temporary result tables can be avoided by collapsing multiple queries into a single query unless the temporary relation is needed for some intermediate processing.
In some situations involving the use of correlated queries, temporaries are useful. Consider the following query, which retrieves the highest paid employee in each department:
FROM EMPLOYEE E
WHERE Salary = SELECT MAX (Salary)
FROM EMPLOYEE AS M
WHERE M.Dno = E.Dno;
This has the potential danger of searching all of the inner EMPLOYEE table M for each tuple from the outer EMPLOYEE table E. To make the execution more efficient, the process can be broken into two queries, where the first query just computes the maximum salary in each department as follows:
SELECT MAX (Salary) AS High_salary, Dno INTO TEMP
GROUP BY Dno;
FROM EMPLOYEE, TEMP
WHERE EMPLOYEE.Salary = TEMP.High_salary
AND EMPLOYEE.Dno = TEMP.Dno;
If multiple options for a join condition are possible, choose one that uses a clustering index and avoid those that contain string comparisons. For exam-ple, assuming that the Name attribute is a candidate key in EMPLOYEE and
STUDENT, it is better to use EMPLOYEE.Ssn = STUDENT.Ssn as a join condi-tion rather than EMPLOYEE.Name = STUDENT.Name if Ssn has a clustering index in one or both tables.
One idiosyncrasy with some query optimizers is that the order of tables in the FROM-clause may affect the join processing. If that is the case, one may have to switch this order so that the smaller of the two relations is scanned and the larger relation is used with an appropriate index.
Some query optimizers perform worse on nested queries compared to their equivalent unnested counterparts. There are four types of nested queries:
Uncorrelated subqueries with aggregates in an inner query.
Uncorrelated subqueries without aggregates.
Correlated subqueries with aggregates in an inner query.
Correlated subqueries without aggregates.
Of the four types above, the first one typically presents no problem, since most query optimizers evaluate the inner query once. However, for a query of the second type, such as the example in item 2, most query optimizers may not use an index on Dno in EMPLOYEE. However, the same optimizers may do so if the query is written as an unnested query. Transformation of correlated subqueries may involve setting temporary tables. Detailed exam-ples are outside our scope here.
Finally, many applications are based on views that define the data of interest to those applications. Sometimes, these views become overkill, because a query may be posed directly against a base table, rather than going through a view that is defined by a JOIN.
4. Additional Query Tuning Guidelines
Additional techniques for improving queries apply in certain situations as follows:
A query with multiple selection conditions that are connected via OR may not be prompting the query optimizer to use any index. Such a query may be split up and expressed as a union of queries, each with a condition on an attribute that causes an index to be used. For example,
SELECT Fname, Lname, Salary, Age
WHERE Age > 45 OR Salary < 50000;
may be executed using sequential scan giving poor performance. Splitting it up as
SELECT Fname, Lname, Salary, Age
WHERE Age > 45
SELECT Fname, Lname, Salary, Age
WHERE Salary < 50000;
may utilize indexes on Age as well as on Salary.
To help expedite a query, the following transformations may be tried:
NOT condition may be transformed into a positive expression.
Embedded SELECT blocks using IN, = ALL, = ANY, and = SOME may be
replaced by joins.
If an equality join is set up between two tables, the range predicate (selec-tion condition) on the joining attribute set up in one table may be repeated for the other table.
3. WHERE conditions may be rewritten to utilize the indexes on multiple columns. For example,
SELECT Region#, Prod_type, Month, Sales
WHERE Region# = 3 AND ((Prod_type BETWEEN 1 AND 3) OR (Prod_type
BETWEEN 8 AND 10));
may use an index only on Region# and search through all leaf pages of the index for a match on Prod_type. Instead, using
SELECT Region#, Prod_type, Month, Sales
WHERE (Region# = 3 AND (Prod_type BETWEEN 1 AND 3))
OR (Region# = 3 AND (Prod_type BETWEEN 8 AND 10));
may use a composite index on (Region#, Prod_type) and work much more efficiently.
In this section, we have covered many of the common instances where the inefficiency of a query may be fixed by some simple corrective action such as using a temporary table, avoiding certain types of query constructs, or avoiding the use of views. The goal is to have the RDBMS use existing single attribute or composite attribute indexes as much as possible. This avoids full scans of data blocks or entire scanning of index leaf nodes. Redundant processes like sorting must be avoided at any cost. The problems and the remedies will depend upon the workings of a query optimizer within an RDBMS. Detailed literature exists in database tuning guidelines for database administration by the RDBMS vendors. Major relational DBMS vendors like Oracle, IBM and Microsoft encourage their large customers to share ideas of tuning at the annual expos and other forums so that the entire industry benefits by using performance enhancement techniques. These techniques are typically available in trade literature and on various Web sites.