Building a Data Warehouse
In constructing a data warehouse, builders should take a broad view of the anticipated use of the warehouse. There is no way to anticipate all possible queries or analyses during the design phase. However, the design should specifically support ad-hoc querying, that is, accessing data with any meaningful combination of values for the attributes in the dimension or fact tables. For example, a marketing-intensive consumer-products company would require different ways of organizing the data warehouse than would a nonprofit charity focused on fund raising. An appropriate schema should be chosen that reflects anticipated usage.
Acquisition of data for the warehouse involves the following steps:
The data must be extracted from multiple, heterogeneous sources, for example, databases or other data feeds such as those containing financial market data or environmental data.
Data must be formatted for consistency within the warehouse. Names, meanings, and domains of data from unrelated sources must be reconciled. For instance, subsidiary companies of a large corporation may have different fiscal calendars with quarters ending on different dates, making it difficult to aggregate financial data by quarter. Various credit cards may report their transactions differently, making it difficult to compute all credit sales. These format inconsistencies must be resolved.
The data must be cleaned to ensure validity. Data cleaning is an involved and complex process that has been identified as the largest labor-demanding component of data warehouse construction. For input data, cleaning must occur before the data is loaded into the warehouse. There is nothing about cleaning data that is specific to data warehousing and that could not be applied to a host database. However, since input data must be examined and formatted consistently, data warehouse builders should take this opportunity to check for validity and quality. Recognizing erroneous and incomplete data is difficult to automate, and cleaning that requires automatic error correction can be even tougher. Some aspects, such as domain checking, are easily coded into data cleaning routines, but automatic recognition of other data problems can be more challenging. (For example, one might require that City = ‘San Francisco’ together with State = ‘CT’ be recognized as an incorrect combination.) After such problems have been taken care of, similar data from different sources must be coordinated for loading into the ware-house. As data managers in the organization discover that their data is being cleaned for input into the warehouse, they will likely want to upgrade their data with the cleaned data. The process of returning cleaned data to the source is called backflushing (see Figure 29.1).
The data must be fitted into the data model of the warehouse. Data from the various sources must be installed in the data model of the warehouse. Data may have to be converted from relational, object-oriented, or legacy data-bases (network and/or hierarchical) to a multidimensional model.
The data must be loaded into the warehouse. The sheer volume of data in the warehouse makes loading the data a significant task. Monitoring tools for loads as well as methods to recover from incomplete or incorrect loads are required. With the huge volume of data in the warehouse, incremental updating is usually the only feasible approach. The refresh policy will probably emerge as a compromise that takes into account the answers to the fol-lowing questions:
· How up-to-date must the data be?
· Can the warehouse go offline, and for how long?
· What are the data interdependencies?
· What is the storage availability?
· What are the distribution requirements (such as for replication and parti-tioning)?
· What is the loading time (including cleaning, formatting, copying, trans-mitting, and overhead such as index rebuilding)?
As we have said, databases must strike a balance between efficiency in transaction processing and supporting query requirements (ad hoc user requests), but a data warehouse is typically optimized for access from a decision maker’s needs. Data storage in a data warehouse reflects this specialization and involves the following processes:
· Storing the data according to the data model of the warehouse
· Creating and maintaining required data structures
· Creating and maintaining appropriate access paths
· Providing for time-variant data as new data are added
· Supporting the updating of warehouse data
· Refreshing the data
· Purging data
Although adequate time can be devoted initially to constructing the warehouse, the sheer volume of data in the warehouse generally makes it impossible to simply reload the warehouse in its entirety later on. Alternatives include selective (partial) refreshing of data and separate warehouse versions (requiring double storage capacity for the warehouse!). When the warehouse uses an incremental data refreshing mechanism, data may need to be periodically purged; for example, a warehouse that maintains data on the previous twelve business quarters may periodically purge its data each year.
Data warehouses must also be designed with full consideration of the environment in which they will reside. Important design considerations include the following:
· Usage projections
· The fit of the data model
· Characteristics of available sources
· Design of the metadata component
· Modular component design
· Design for manageability and change
· Considerations of distributed and parallel architecture
We discuss each of these in turn. Warehouse design is initially driven by usage projections; that is, by expectations about who will use the warehouse and how they will use it. Choice of a data model to support this usage is a key initial decision. Usage projections and the characteristics of the warehouse’s data sources are both taken into account. Modular design is a practical necessity to allow the warehouse to evolve with the organization and its information environment. Additionally, a well built data warehouse must be designed for maintainability, enabling the warehouse managers to plan for and manage change effectively while providing optimal sup-port to users.
You may recall the term metadata from Chapter 1; metadata was defined as the description of a database including its schema definition. The metadata repository is a key data warehouse component. The metadata repository includes both technical and business metadata. The first, technical metadata, covers details of acquisition processing, storage structures, data descriptions, warehouse operations and maintenance, and access support functionality. The second, business metadata, includes the relevant business rules and organizational details supporting the warehouse.
The architecture of the organization’s distributed computing environment is a major determining characteristic for the design of the warehouse.
There are two basic distributed architectures: the distributed warehouse and the federated warehouse. For a distributed warehouse, all the issues of distributed databases are relevant, for example, replication, partitioning, communications, and consistency concerns. A distributed architecture can provide benefits particularly important to warehouse performance, such as improved load balancing, scalability of performance, and higher availability. A single replicated metadata repository would reside at each distribution site. The idea of the federated warehouse is like that of the federated database: a decentralized confederation of autonomous data warehouses, each with its own metadata repository. Given the magnitude of the challenge inherent to data warehouses, it is likely that such federations will consist of smaller scale components, such as data marts. Large organizations may choose to federate data marts rather than build huge data warehouses.