Interestingness Of Patterns
A data mining system has the potential to generate thousands or even millions of patterns, or rules. then “are all of the patterns interesting?” Typically not—only a small fraction of the patterns potentially generated would actually be of interest to any given user.
This raises some serious questions for data mining. You may wonder, “What makes a pattern interesting? Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns?”
To answer the first question, a pattern is interesting if it is
easily understood by humans,
(2)valid on new or test data with some degree of certainty,
potentially useful, and
A pattern is also interesting if it validates a hypothesis that the user sought to confirm. An interesting pattern represents knowledge.
Several objective measures of pattern interestingness exist. These are based on the structure of discovered patterns and the statistics underlying them. An objective measure for association rules of the form X Y is rule support, representing the percentage of transactions from a transaction database that the given rule satisfies.
This is taken to be the probability P(XUY),where XUY indicates that a transaction contains both X and Y, that is, the union of itemsets X and Y. Another objective measure for association rules is confidence, which assesses the degree of certainty of the detected association. This is taken to be the conditional probability P(Y | X), that is, the probability that a transaction containing X also contains Y. More formally, support and confidence are defined as
support(X Y) = P(XUY) confidence(X Y) = P(Y | X)
In general, each interestingness measure is associated with a threshold, which may be controlled by the user. For example, rules that do not satisfy a confidence threshold of, say, 50% can be considered uninteresting. Rules below the threshold threshold likely reflect noise, exceptions, or minority cases and are probably of less value.
The second question—―Can a data mining system generate all of the interesting patterns?‖—refers to the completeness of a data mining algorithm. It is often unrealistic and inefficient for data mining systems to generate all of the possible patterns. Instead, user-provided constraints and interestingness measures should be used to focus the search.
The third question—“Can a data mining system generate only interesting atterns?”—is an optimization problem in data mining. It is highly desirable for data mining systems to generate only interesting patterns. This would be much more efficient for users and data mining systems, because neither would have to search through the patterns generated in order to identify the truly interesting ones. Progress has been made in this direction; however, such optimization remains a challenging issue in data mining.