Mining Functionalities—What Kinds of Patterns Can Be Mined?
mining functionalities are used to specify the kind of patterns to be found in
data mining tasks. data mining tasks can be classified into two categories:
descriptive and predictive.
mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make
Description: Characterization and Discrimination
be associated with classes or concepts. For example, in the AllElectronics store, classes of items
for sale include computers and printers, and concepts of customers
include bigSpenders and budgetSpenders. It can be useful to describe individual classes and concepts
summarized, concise, and yet precise terms. Such descriptions of a
class or a concept are called class/concept descriptions. These descriptions
can be derived via
data characterization, by
summarizing the data of the class under study (often called the target class)
in general terms,
data discrimination, by
comparison of the target class with one or a set of comparative classes (often
called the contrasting classes), or (3) both data characterization and
Data characterization is a
summarization of the general characteristics or features of a target class of data. The data corresponding
to the user-specified class are typically collected by a database query the
output of data characterization can be presented in various forms. Examples
include pie charts, bar charts, curves, multidimensional data cubes, and
multidimensional tables, including crosstabs.
Data discrimination is a
comparison of the general features of target class data objects with the general features of objects from one
or a set of contrasting classes. The target and contrasting classes can be
specified by the user, and the corresponding data objects retrieved through
“How are discrimination
descriptions expressed in rule form are referred to as discriminate rules.
Frequent Patterns, Associations, and Correlations
Frequent patterns, as the
name suggests, are patterns that occur frequently in data. There are many kinds of frequent patterns,
including itemsets, subsequences, and substructures.
A frequent itemset typically refers to a set of
items that frequently appear together in a
transactional data set, such as Computer and Software. A frequently
occurring subsequence, such as thepattern that customers tend to purchase first
a PC, followed by a digital camera, and then a memory card, is a (frequent) sequential pattern.
analysis. Suppose, as a marketing manager of AllElectronics, you
would like to determine which items
are frequently purchased together within the same transactions. An example of
such a rule, mined from the AllElectronics
transactional database, is buys(X;
―computer‖) buys(X; ―software‖)
[support = 1%, confidence = 50%]
where X is a variable representing a customer.
A confidence, or certainty, of 50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well. A 1% support means
that 1% of all of the transactions under analysis showed that computer and
software were purchased together. This association rule involves a single
attribute or predicate (i.e., buys)
that repeats. Association rules that contain a single predicate are referred to
as single-dimensional association rules. Dropping the predicate notation, the
above rule can be written simply as ―compute
software [1%, 50%]‖.
is the process of finding a model (or function) that describes and
distinguishes data classes or concepts, for the purpose of being able to use
the model to predict the class of objects whose class label is unknown. The
derived model is based on the analysis of a set of training data (i.e., data
objects whose class label is known).
“How is the derived model
presented?” The derived model may be represented in various forms, such as classification (IF-THEN) rules,
decision trees, mathematical formulae, or neural
A decision tree is a
flow-chart-like tree structure, where each node denotes a test on an attribute value, each branch represents an
outcome of the test, and tree leaves represent classes or class distributions.
Decision trees can easily be converted to classification rules
A neural network, when
used for classification, is typically a collection of neuron-like processing units with weighted
connections between the units. There are many other methods for constructing
classification models, such as naïve
classification, support vector machines, and k-nearest neighbor classification. Whereas classification predicts
categorical (discrete, unordered) labels, prediction models Continuous-valued
functions. That is, it is used to predict missing or unavailable numerical data values rather than class labels. Although the term prediction may refer to both numeric prediction and class label prediction,
and prediction analyze class-labeled data objects, where as clustering analyzes data objects
without consulting a known class label.
database may contain data objects that do not comply with the general behavior
or model of the data. These data objects are outliers. Most data mining methods
discard outliers as noise or exceptions. However, in some applications such as
fraud detection, the rare events can be more interesting than the more
regularly occurring ones. The analysis of outlier data is referred to as
evolution analysis describes and models regularities or trends for objects
whose behavior changes over time. Although this may include characterization,
discrimination, association and correlation analysis, classification,
prediction, or clustering of time related
data, distinct features of such an analysis include time-series data
analysis,Sequence or periodicity pattern matching, and similarity-based data