Overview of Data Mining Technology
In reports such as the very popular Gartner Report, data mining has been
hailed as one of the top technologies for the near future. In this section we
relate data mining to the broader area called knowledge discovery and contrast the two by means of an
1. Data Mining versus
The goal of a data warehouse (see Chapter 29) is to support decision
making with data. Data mining can be used in conjunction with a data warehouse
to help with certain types of decisions. Data mining can be applied to
operational databases with individual transactions. To make data mining more
efficient, the data ware-house should have an aggregated or summarized
collection of data. Data mining helps in extracting meaningful new patterns
that cannot necessarily be found by merely querying or processing data or
metadata in the data warehouse. Therefore, data mining applications should be
strongly considered early, during the design of a data warehouse. Also, data
mining tools should be designed to facilitate their use in conjunction with
data warehouses. In fact, for very large databases running into terabytes and
even petabytes of data, successful use of data mining applications will depend
first on the construction of a data warehouse.
2. Data Mining as a Part of the Knowledge Discovery Process
Knowledge Discovery in Databases, frequently abbreviated as KDD,
typically encompasses more than data
mining. The knowledge discovery process comprises six phases: data selection, data cleansing,
enrichment, data transformation or encoding, data mining, and the reporting and
display of the discovered information.
As an example, consider a transaction database maintained by a specialty
consumer goods retailer. Suppose the client data includes a customer name, ZIP
Code, phone number, date of purchase, item code, price, quantity, and total
amount. A variety of new knowledge can be discovered by KDD processing on this
client database. During data selection,
data about specific items or categories of items, or from stores in a specific
region or area of the country, may be selected. The data cleansing process then may correct invalid ZIP Codes or
eliminate records with incorrect phone prefixes. Enrichment typically enhances the data with additional sources of
information. For example, given the client names and phone numbers, the store
may purchase other data about age, income, and credit rating and append them to
each record. Data transformation and
encoding may be done to reduce the amount of data. For instance, item codes may
be grouped in terms of product categories into audio, video, supplies,
electronic gadgets, camera, accessories, and so on. ZIP Codes may be aggregated
into geographic regions, incomes may be divided into ranges, and so on. In
Figure 29.1, we will show a step called cleaning
as a precursor to the data warehouse creation. If data mining is based on an
existing warehouse for this retail store chain, we would expect that the
cleaning has already been applied. It is only after such preprocessing that data mining techniques are used to mine
different rules and patterns.
The result of mining may be to discover the following type of new information:
Association rules—for example, whenever a customer
buys video equipment, he or she also buys another electronic gadget.
Sequential patterns—for example, suppose a customer
buys a camera, and within three
months he or she buys photographic supplies, then within six months he is
likely to buy an accessory item. This defines a sequential pat-tern of
transactions. A customer who buys more than twice in lean periods may be likely
to buy at least once during the Christmas period.
Classification trees—for example, customers may be
classified by frequency of visits,
types of financing used, amount of purchase, or affinity for types of items;
some revealing statistics may be generated for such classes.
We can see that many possibilities exist for discovering new knowledge
about buying patterns, relating factors such as age, income group, place of
residence, to what and how much the customers purchase. This information can
then be utilized to plan additional store locations based on demographics, run
store promotions, combine items in advertisements, or plan seasonal marketing
strategies. As this retail store example shows, data mining must be preceded by
significant data preparation before it can yield useful information that can
directly influence business decisions.
The results of data mining may be reported in a variety of formats, such
as listings, graphic outputs, summary tables, or visualizations.
3. Goals of Data Mining
and Knowledge Discovery
Data mining is typically carried out with some end goals or
applications. Broadly speaking, these goals fall into the following classes:
prediction, identification, classification, and optimization.
Prediction. Data mining can show how certain attributes within the data will behave in the future. Examples of
predictive data mining include the analysis of buying transactions to predict
what consumers will buy under certain discounts, how much sales volume a store
will generate in a given period, and whether deleting a product line will yield
more profits. In such applications, business logic is used coupled with data
mining. In a scientific context, certain seismic wave patterns may predict an
earthquake with high probability.
Identification. Data patterns can be used to
identify the existence of an item, an
event, or an activity. For example, intruders trying to break a system may be
identified by the programs executed, files accessed, and CPU time per session.
In biological applications, existence of a gene may be identified by certain
sequences of nucleotide symbols in the DNA sequence. The area known as authentication is a form of
identification. It ascertains whether a user is indeed a specific user or one
from an authorized class, and involves a comparison of parameters or images or
signals against a database.
Classification. Data mining can partition the
data so that different classes or
categories can be identified based on combinations of parameters. For example,
customers in a supermarket can be categorized into discount-seeking shoppers,
shoppers in a rush, loyal regular shoppers, shoppers attached to name brands,
and infrequent shoppers. This classification may be used in different analyses
of customer buying transactions as a post-mining activity. Sometimes
classification based on common domain knowledge is used as an input to
decompose the mining problem and make it simpler. For instance, health foods,
party foods, or school lunch foods are distinct categories in the supermarket
business. It makes sense to analyze relationships within and across categories
as separate problems. Such cate-gorization may be used to encode the data
appropriately before subjecting it to further data mining.
Optimization. One eventual goal of data mining
may be to optimize the use of
limited resources such as time, space, money, or materials and to maximize
output variables such as sales or profits under a given set of constraints. As
such, this goal of data mining resembles the objective function used in
operations research problems that deals with optimization under constraints.
The term data mining is popularly used in a very broad sense. In some
situations it includes statistical analysis and constrained optimization as
well as machine learning. There is no sharp line separating data mining from
these disciplines. It is beyond our scope, therefore, to discuss in detail the
entire range of applications that make up this vast body of work. For a
detailed understanding of the topic, readers are referred to specialized books
devoted to data mining.
4. Types of Knowledge Discovered during Data Mining
The term knowledge is broadly
interpreted as involving some degree of intelligence. There is a progression
from raw data to information to knowledge as we go through additional
processing. Knowledge is often classified as inductive versus deductive. Deductive knowledge deduces new
information based on applying prespecified logical rules of deduction on the given data. Data mining addresses
inductive knowl-edge, which
discovers new rules and patterns from the supplied data. Knowledge can be represented in many forms: In
an unstructured sense, it can be represented by rules or propositional logic.
In a structured form, it may be represented in decision trees, semantic
networks, neural networks, or hierarchies of classes or frames. It is common to
describe the knowledge discovered during data mining as follows:
Association rules. These rules correlate the presence of a set of items with another range of values for another set of variables. Examples: (1) When a female retail shopper buys a handbag, she is likely to buy shoes. (2) An X-ray image containing characteristics a and b is likely to also exhibit characteristic c.
Classification hierarchies. The goal
is to work from an existing set of events
or transactions to create a hierarchy of classes. Examples: (1) A
population may be divided into five ranges of credit worthiness based on a
history of previous credit transactions. (2) A model may be developed for the
factors that determine the desirability of a store location on a 1–10 scale.
(3) Mutual funds may be classified based on performance data using
characteristics such as growth, income, and stability.
Sequential patterns. A
sequence of actions or events is sought. Example: If a patient underwent cardiac bypass surgery for blocked arteries and
an aneurysm and later developed high blood urea within a year of surgery, he or
she is likely to suffer from kidney failure within the next 18 months.
Detection of sequential patterns is equivalent to detecting associations among
events with certain temporal relationships.
Patterns within time series. Similarities can be detected within positions of a time series of data, which is a sequence of data taken at regular intervals, such as daily sales or daily closing stock prices. Examples: (1) Stocks of a utility company, ABC Power, and a financial company, XYZ Securities, showed the same pattern during 2009 in terms of closing stock prices. (2) Two products show the same selling pattern in summer but a different one in winter. (3) A pattern in solar magnetic wind may be used to predict changes in Earth’s atmospheric conditions.
Clustering. A given population of events or
items can be partitioned (segmented) into sets of “similar” elements.
Examples: (1) An entire population of treatment data on a disease may be
divided into groups based on the similarity of side effects produced. (2) The
adult population in the United States may be categorized into five groups from most likely to buy to least likely to buy a new product. (3) The Web accesses made by a collection of
users against a set of documents
(say, in a digital library) may be analyzed in terms of the keywords of
documents to reveal clusters or categories of users.
most applications, the desired knowledge is a combination of the above types.
We expand on each of the above knowledge types in the following sections.