Characteristics of Data Warehouses
To discuss data warehouses and distinguish them from transactional databases calls for an appropriate data model. The multidimensional data model (explained in more detail in Section 29.3) is a good fit for OLAP and decision-support technologies. In contrast to multidatabases, which provide access to disjoint and usually heterogeneous databases, a data warehouse is frequently a store of integrated data from multiple sources, processed for storage in a multidimensional model. Unlike most transactional databases, data warehouses typically support time-series and trend analysis, both of which require more historical data than is generally maintained in transactional databases.
Compared with transactional databases, data warehouses are nonvolatile. This means that information in the data warehouse changes far less often and may be regarded as non–real-time with periodic updating. In transactional systems, transactions are the unit and are the agent of change to the database; by contrast, data ware-house information is much more coarse-grained and is refreshed according to a careful choice of refresh policy, usually incremental. Warehouse updates are handled by the warehouse’s acquisition component that provides all required preprocessing.
We can also describe data warehousing more generally as a collection of decision sup-port technologies, aimed at enabling the knowledge worker (executive, manager, analyst) to make better and faster decisions. Figure 29.1 gives an overview of the conceptual structure of a data warehouse. It shows the entire data warehousing process, which includes possible cleaning and reformatting of data before loading it into the warehouse. This process is handled by tools known as ETL (extraction, transformation, and loading) tools. At the back end of the process, OLAP, data mining, and DSS may generate new relevant information such as rules; this information is shown in the figure going back into the warehouse. The figure also shows that data sources may include files.
Data warehouses have the following distinctive characteristics:
Multidimensional conceptual view
Unlimited dimensions and aggregation levels
Unrestricted cross-dimensional operations
Dynamic sparse matrix handling
Intuitive data manipulation
Consistent reporting performance
Because they encompass large volumes of data, data warehouses are generally an order of magnitude (sometimes two orders of magnitude) larger than the source databases. The sheer volume of data (likely to be in terabytes or even petabytes) is an issue that has been dealt with through enterprise-wide data warehouses, virtual data warehouses, and data marts:
Enterprise-wide data warehouses are huge projects requiring massive investment of time and resources.
Virtual data warehouses provide views of operational databases that are materialized for efficient access.
Data marts generally are targeted to a subset of the organization, such as a department, and are more tightly focused.