Approaches to Other Data Mining Problems
1. Discovery of
Sequential Patterns
The
discovery of sequential patterns is based on the concept of a sequence of
item-sets. We assume that transactions such as the supermarket-basket
transactions we discussed previously are ordered by time of purchase. That
ordering yields a sequence of itemsets. For example, {milk, bread, juice},
{bread, eggs}, {cookies, milk, coffee} may be such a sequence of itemsets based on three visits by the same customer to
the store. The support for a
sequence S of itemsets is the
percentage of the given set U of
sequences of which S is a subsequence.
In this example, {milk, bread, juice} {bread, eggs} and {bread, eggs} {cookies,
milk, coffee} are considered subsequences.
The problem of identifying sequential patterns, then, is to find all subsequences from the given sets of
sequences that have a user-defined minimum support. The sequence S1, S2, S3,
... is a predictor of the fact that
a customer who buys itemset S1
is likely to buy itemset S2
and then S3, and so on.
This prediction is based on the frequency (support) of this sequence in the
past. Various algorithms have been investigated for sequence detection.
2. Discovery of
Patterns in Time Series
Time series are sequences of events; each event may be a given fixed type of a
trans-action. For example, the closing price of a stock or a fund is an event
that occurs every weekday for each stock and fund. The sequence of these values
per stock or fund constitutes a time series. For a time series, one may look
for a variety of pat-terns by analyzing sequences and subsequences as we did
above. For example, we might find the period during which the stock rose or
held steady for n days, or we might
find the longest period over which the stock had a fluctuation of no more than
1 percent over the previous closing price, or we might find the quarter during
which the stock had the most percentage gain or percentage loss. Time series
may be compared by establishing measures of similarity to identify companies
whose stocks behave in a similar fashion. Analysis and mining of time series is
an extended functionality of temporal data management (see Chapter 26).
3. Regression
Regression is a special application of the
classification rule. If a classification rule is regarded as a function over the variables that maps these
variables into a target class variable, the rule is called a regression rule. A general application
of regression occurs when, instead of mapping a tuple of data from a relation
to a specific class, the value of a variable is predicted based on that tuple.
For example, consider a relation
LAB_TESTS (patient ID, test 1, test 2, ...,
test n)
which
contains values that are results from a series of n tests for one patient. The target variable that we wish to
predict is P, the probability of
survival of the patient. Then the rule for regression takes the form:
(test 1
in range1) and (test 2 in range2) and ... (test n in rangen) ⇒ P = x,
or x < P ≤ y
The
choice depends on whether we can predict a unique value of P or a range of val-ues for P.
If we regard P as a function:
P = f (test 1, test 2, ..., test n)
the
function is called a regression function
to predict P. In general, if the
function appears as
Y = f (X1, X2, ..., Xn),
and f is linear in the domain variables xi, the process of deriving f from a given set of tuples for <X1, X2, ..., Xn, y> is called linear
regression. Linear regression is a commonly used statistical technique for
fitting a set of observations or points in n
dimensions with the target variable y.
Regression
analysis is a very common tool for analysis of data in many research domains.
The discovery of the function to predict the target variable is equivalent to a
data mining operation.
4. Neural Networks
A neural network is a
technique derived from artificial intelligence research that uses generalized
regression and provides an iterative method to carry it out. Neural networks
use the curve-fitting approach to infer a function from a set of samples. This
technique provides a learning approach;
it is driven by a test sample that is used for the initial inference and
learning. With this kind of learning method, responses to new inputs may be
able to be interpolated from the known samples. This interpolation, however,
depends on the world model (internal representation of the problem domain)
developed by the learning method.
Neural networks can be broadly classified into two categories:
supervised and unsupervised networks. Adaptive methods that attempt to reduce
the output error are supervised learning
methods, whereas those that develop internal representations without sample outputs are called unsupervised learning methods.
Neural networks self-adapt; that is, they learn from information about a
specific problem. They perform well on classification tasks and are therefore
useful in data mining. Yet, they are not without problems. Although they learn,
they do not provide a good representation of what they have learned. Their outputs are highly quantitative and
not easy to understand. As another limitation, the internal representations
developed by neural networks are not unique. Also, in general, neural networks
have trouble modeling time series data. Despite these shortcomings, they are
popular and frequently used by several commercial vendors.
5. Genetic Algorithms
Genetic algorithms (GAs) are
a class of randomized search procedures capable of adaptive and robust search over a wide range of search spacetopologies. Modeled after the adaptive emergence of biological species from
evolutionary mechanisms, and introduced by Holland, GAs have been
successfully applied in such diverse fields as image analysis, scheduling, and
engineering design.
Genetic
algorithms extend the idea from human genetics of the four-letter alphabet
(based on the A,C,T,G nucleotides) of the human DNA code. The construction of a
genetic algorithm involves devising an alphabet that encodes the solutions to
the decision problem in terms of strings of that alphabet. Strings are
equivalent to individuals. A fitness function defines which solutions can
survive and which cannot. The ways in which solutions can be combined are
patterned after the cross-over operation of cutting and combining strings from
a father and a mother. An initial population of a well-varied population is
provided, and a game of evolution is played in which mutations occur among
strings. They combine to produce a new generation of individuals; the fittest
individuals survive and mutate until a family of successful solutions develops.
The
solutions produced by GAs are distinguished from most other search techniques
by the following characteristics:
·
A GA search uses a set of solutions during each
generation rather than a single solution.
·
The search in the string-space represents a much
larger parallel search in the space of encoded solutions.
·
The memory of the search done is represented solely
by the set of solutions available for a generation.
·
A genetic algorithm is a randomized algorithm since
search mechanisms use probabilistic operators.
·
While progressing from one generation to the next,
a GA finds near-optimal balance between knowledge acquisition and exploitation
by manipulating encoded solutions.
Genetic
algorithms are used for problem solving and clustering problems. Their ability
to solve problems in parallel provides a powerful tool for data mining. The
drawbacks of GAs include the large overproduction of individual solutions, the
random character of the searching process, and the high demand on computer
processing. In general, substantial computing power is required to achieve
anything of significance with genetic algorithms.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2024 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.