Building a Data Warehouse
In constructing a data warehouse, builders should take a broad view of
the anticipated use of the warehouse. There is no way to anticipate all
possible queries or analyses during the design phase. However, the design
should specifically support ad-hoc
querying, that is, accessing data with any meaningful combination of values for the attributes in the dimension or
fact tables. For example, a marketing-intensive consumer-products company would
require different ways of organizing the data warehouse than would a nonprofit
charity focused on fund raising. An appropriate schema should be chosen that
reflects anticipated usage.
Acquisition of data for the warehouse involves the following steps:
The data must be extracted from
multiple, heterogeneous sources, for example, databases or other data feeds
such as those containing financial market data or environmental data.
Data must be formatted for
consistency within the warehouse. Names, meanings, and domains of data from
unrelated sources must be reconciled. For instance, subsidiary companies of a
large corporation may have different fiscal calendars with quarters ending on
different dates, making it difficult to aggregate financial data by quarter.
Various credit cards may report their transactions differently, making it difficult
to compute all credit sales. These format inconsistencies must be resolved.
The data must be cleaned to
ensure validity. Data cleaning is an involved and complex process that has been
identified as the largest labor-demanding component of data warehouse
construction. For input data, cleaning must occur before the data is loaded
into the warehouse. There is nothing about cleaning data that is specific to
data warehousing and that could not be applied to a host database. However,
since input data must be examined and formatted consistently, data warehouse
builders should take this opportunity to check for validity and quality.
Recognizing erroneous and incomplete data is difficult to automate, and
cleaning that requires automatic error correction can be even tougher. Some
aspects, such as domain checking, are easily coded into data cleaning
routines, but automatic recognition of other data problems can be more
challenging. (For example, one might require that City = ‘San Francisco’ together with State = ‘CT’
be recognized as an incorrect combination.) After such problems have been taken
care of, similar data from different sources must be coordinated for loading
into the ware-house. As data managers in the organization discover that their
data is being cleaned for input into the warehouse, they will likely want to
upgrade their data with the cleaned data. The process of returning cleaned data
to the source is called backflushing
(see Figure 29.1).
The data must be fitted into the
data model of the warehouse. Data from the various sources must be installed in
the data model of the warehouse. Data may have to be converted from relational,
object-oriented, or legacy data-bases (network and/or hierarchical) to a
multidimensional model.
The data must be loaded into the
warehouse. The sheer volume of data in the warehouse makes loading the data a
significant task. Monitoring tools for loads as well as methods to recover from
incomplete or incorrect loads are required. With the huge volume of data in the
warehouse, incremental updating is usually the only feasible approach. The
refresh policy will probably emerge as a compromise that takes into account
the answers to the fol-lowing questions:
·
How up-to-date must the data be?
·
Can the warehouse go offline, and
for how long?
·
What are the data
interdependencies?
·
What is the storage availability?
·
What are the distribution
requirements (such as for replication and parti-tioning)?
·
What is the loading time
(including cleaning, formatting, copying, trans-mitting, and overhead such as
index rebuilding)?
As we have said, databases must strike a balance between efficiency in
transaction processing and supporting query requirements (ad hoc user
requests), but a data warehouse is typically optimized for access from a
decision maker’s needs. Data storage in a data warehouse reflects this
specialization and involves the following processes:
·
Storing the data according to the
data model of the warehouse
·
Creating and maintaining required
data structures
·
Creating and maintaining
appropriate access paths
·
Providing for time-variant data
as new data are added
·
Supporting the updating of
warehouse data
·
Refreshing the data
·
Purging data
Although adequate time can be devoted initially to constructing the
warehouse, the sheer volume of data in the warehouse generally makes it
impossible to simply reload the warehouse in its entirety later on.
Alternatives include selective (partial) refreshing of data and separate
warehouse versions (requiring double storage capacity for the warehouse!).
When the warehouse uses an incremental data refreshing mechanism, data may need
to be periodically purged; for example, a warehouse that maintains data on the
previous twelve business quarters may periodically purge its data each year.
Data warehouses must also be designed with full consideration of the
environment in which they will reside. Important design considerations include
the following:
·
Usage projections
·
The fit of the data model
·
Characteristics of available
sources
·
Design of the metadata component
·
Modular component design
·
Design for manageability and
change
·
Considerations of distributed and
parallel architecture
We discuss each of these in turn. Warehouse design is initially driven
by usage projections; that is, by expectations about who will use the
warehouse and how they will use it. Choice of a data model to support this
usage is a key initial decision. Usage projections and the characteristics of
the warehouse’s data sources are both taken into account. Modular design is a
practical necessity to allow the warehouse to evolve with the organization and
its information environment. Additionally, a well built data warehouse must be
designed for maintainability, enabling the warehouse managers to plan for and
manage change effectively while providing optimal sup-port to users.
You may recall the term metadata
from Chapter 1; metadata was defined as the description of a database including
its schema definition. The metadata
repository is a key data warehouse component. The metadata repository
includes both technical and business metadata. The first, technical metadata, covers details of acquisition processing,
storage structures, data descriptions, warehouse operations and maintenance,
and access support functionality. The second, business metadata, includes the relevant business rules and
organizational details supporting the warehouse.
The architecture of the organization’s distributed computing environment
is a major determining characteristic for the design of the warehouse.
There are two basic distributed architectures: the distributed warehouse
and the federated warehouse. For a distributed
warehouse, all the issues of distributed databases are relevant, for
example, replication, partitioning, communications, and consistency concerns. A
distributed architecture can provide benefits particularly important to
warehouse performance, such as improved load balancing, scalability of
performance, and higher availability. A single replicated metadata repository
would reside at each distribution site. The idea of the federated warehouse is like that of the federated database: a
decentralized confederation of autonomous data warehouses, each with its own
metadata repository. Given the magnitude of the challenge inherent to data
warehouses, it is likely that such federations will consist of smaller scale
components, such as data marts. Large organizations may choose to federate data
marts rather than build huge data warehouses.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.