Major
Issues In Data Mining
The scope
of this book addresses major issues in data mining regarding mining
methodology, user interaction, performance, and diverse data types. These
issues are introduced below:
1. Mining methodology and
user-interaction issues. These reect the kinds of knowledge mined, the ability to mine knowledge at
multiple granularities, the use of domain knowledge, ad-hoc mining, and
knowledge visualization.
Mining different kinds of
knowledge in databases.
Since
different users can be interested in different kinds of knowledge, data mining
should cover a wide spectrum of data analysis and knowledge discovery tasks,
including data characterization, discrimination, association, classification,
clustering, trend and deviation analysis, and similarity analysis. These tasks
may use the same database in different ways and require the development of
numerous data mining techniques.
Interactive mining of knowledge
at multiple levels of abstraction.
Since it
is difficult to know exactly what can be discovered within a database, the data
mining process should be interactive. For databases containing a huge amount of
data, appropriate sampling technique can first be applied to facilitate
interactive data exploration. Interactive mining allows users to focus the
search for patterns, providing and refining data mining requests based on
returned results. Specifically, knowledge should be mined by drilling-down,
rolling-up, and pivoting through the data space and knowledge space
interactively, similar to what OLAP can do on data cubes. In this way, the user
can interact with the data mining system to view data and discovered patterns
at multiple granularities and from different angles.
Incorporation of background
knowledge.
Background
knowledge, or information regarding the domain under study, may be used to guide
the discovery process and allow discovered patterns to be expressed in concise
terms and at different levels of abstraction. Domain knowledge related to
databases, such as integrity constraints and deduction rules, can help focus
and speed up a data mining process, or judge the interestingness of discovered
patterns.
Data mining query languages and
ad-hoc data mining.
Relational
query languages (such as SQL) allow users to pose ad-hoc queries for data
retrieval. In a similar vein, high-level data mining query languages need to be
developed to allow users to describe ad-hoc data mining tasks by facilitating
the speci_cation of the relevant sets of data for analysis, the domain
knowledge, the kinds of knowledge to be mined, and the conditions and interestingness
constraints to be enforced on the discovered patterns. Such a language should
be integrated with a database or data warehouse query language, and optimized
for e_cient and exible data mining.
Presentation and visualization of
data mining results.
Discovered
knowledge should be expressed in high-level languages, visual representations,
or other expressive forms so that the knowledge can be easily understood and
directly usable by humans. This is especially crucial if the data mining system
is to be interactive. This requires the system to adopt expressive knowledge
representation techniques, such as trees, tables, rules, graphs, charts,
crosstabs, matrices, or curves.
Handling outlier or incomplete
data.
The data
stored in a database may reect outliers | noise, exceptional cases, or
incomplete data objects. These objects may confuse the analysis process,
causing over_tting of the data to the knowledge modelconstructed. As a result,
the accuracy of the discovered patterns can be poor. Data cleaning methods and
data analysis methods which can handle outliers are required. While most
methods discard outlier data, such data may be of interest in itself such as in
fraud detection for Finding unusual usage of tele-communication services or
credit cards. This form of data analysis is known as outlier mining.
Pattern evaluation: the
interestingness problem.
A data
mining system can uncover thousands of patterns. Many of the patterns
discovered may be uninteresting to the given user, representing common knowledge
or lacking novelty. Several challenges remain regarding the development of
techniques to assess the interestingness of discovered patterns, particularly
with regard to subjective measures which estimate the value of patterns with
respect to a given user class, based on user beliefs or expectations. The use
of interestingness measures to guide the discovery process and reduce the
search space is another active area of research.
2. Performance issues. These
include efficiency, scalability, and parallelization of data mining algorithms.
Efficiency and scalability of
data mining algorithms.
To
effectively extract information from a huge amount of data in databases, data
mining algorithms must be efficient and scalable. That is, the running time of
a data mining algorithm must be predictable and acceptable in large databases.
Algorithms with exponential or even medium-order polynomial complexity will not
be of practical use. From a database perspective on knowledge discovery,
efficiency and scalability are key issues in the implementation of data mining
systems. Many of the issues discussed above under mining methodology and
user-interaction must also consider efficiency and scalability.
Parallel, distributed, and
incremental updating algorithms.
The huge
size of many databases, the wide distribution of data, and the computational
complexity of some data mining methods are factors motivating the development
of parallel and distributed data mining algorithms. Such algorithms divide the
data into partitions, which are processed in parallel. The results from the
partitions are then merged. Moreover, the high cost of some data mining
processes promotes the need for incremental data mining algorithms which
incorporate database updates without having to mine the entire data again \from
scratch". Such algorithms perform knowledge modification incrementally to
amend and strengthen what was previously discovered.
3. Issues relating to the diversity of database types.
Handling of relational and complex types of data.
There are
many kinds of data stored in databases and data warehouses. Since relational
databases and data warehouses are widely used, the development of efficient and
effective data mining systems for such data is important. However, other
databases may contain complex data objects, hypertext and multimedia data,
spatial data, temporal data, or transaction data. It is unrealistic to expect
one system to mine all kinds of data due to the diversity of data types and
different goals of data mining. Specific data mining systems should be
constructed for mining specific kinds of data. Therefore, one may expect to
have different data mining systems for different kinds of data.
Mining information from
heterogeneous databases and global information systems.
Local and
wide-area computer networks (such as the Internet) connect many sources of
data, forming huge, distributed, and heterogeneous databases. The discovery of
knowledge from di_erent sources of structured, semi-structured, or unstructured
data with diverse data semantics poses great challenges to data mining. Data
mining may help disclose high-level data regularities in multiple heterogeneous
databases that are unlikely to be discovered by simple query systems and may
improve information exchange and interoperability in heterogeneous databases.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.