Chapter: Data Warehousing and Data Mining : Clustering and Applications and Trends in Data Mining

Important Short Questions and Answers : Clustering and Applications and Trends in Data Mining

Data Warehousing and Data Mining - Clustering and Applications and Trends in Data Mining - Important Short Questions and Answers : Clustering and Applications and Trends in Data Mining

What is clustering?

Clustering is the process of grouping the data into classes or clusters so that objects within a cluster have high similarity in comparison to one another, but are very dissimilar to objects in other clusters.

What do you mean by Cluster Analysis?

A cluster analysis is the process of analyzing the various clusters to organize the different objects into meaningful and descriptive objects.

What are the requirements of clustering?

Scalability

Ability to deal with different types of attributes

Ability to deal with noisy data

Minimal requirements for domain knowledge to determine input parameters

Constraint based clustering

Interpretability and usability

State the categories of clustering methods?

Partitioning methods

Hierarchical methods

Density based methods

Grid based methods

Model based methods

5.What are the requirements of cluster analysis?

The basic requirements of cluster analysis are

Dealing with different types of attributes.

Dealing with noisy data.

Constraints on clustering.

Dealing with arbitrary shapes.

High dimensionality

Ordering of input data

Interpretability and usability

Determining input parameter and

Scalability

6.What are the different types of data used for cluster analysis?

The different types of data used for cluster analysis are interval scaled, binary, ominal, ordinal and ratio scaled data.

7. What are interval scaled variables?

Interval scaled variables are continuous measurements of linear scale. For example, height and weight, weather temperature or coordinates for any cluster. These measurements can be calculated using Euclidean distance or Minkowski distance.

8. Define Binary variables? And what are the two types of binary variables?

Binary variables are understood by two states 0 and 1, when state is 0, variable is absent and when state is 1, variable is present.

There are two types of binary variables, symmetric and asymmetric binary variables. Symmetric variables are those variables that have same state values and weights. Asymmetric variables are those variables that have not same state values and weights.

9. Define nominal, ordinal and ratio scaled variables?

A nominal variable is a generalization of the binary variable. Nominal variable has more than two states, For example, a nominal variable, color consists of four states, red, green, yellow, or black. In Nominal variables the total number of states is Nand it is denoted by letters, symbols or integers.

An ordinal variable also has more than two states but all these states are ordered in a meaningful sequence. A ratio scaled variable makes positive measurements on a non-linear scale, such as exponential scale,

10. What do u mean by partitioning method?

In partitioning method a partitioning algorithm arranges all the objects into various partitions, where the total number of partitions is less than the total number of objects. Here each partition represents a cluster. The two types of partitioning method are k-means and k-medoids.

Define CLARA and CLARANS?

Clustering in LARge Applications is called as CLARA. The efficiency of CLARA depends upon the size of the representative data set. CLARA does not work properly if any representative data set from the selected representative data sets does not find best k-medoids. To recover this drawback a new algorithm,

Clustering Large Applications based upon RANdomized search (CLARANS) is introduced. The CLARANS works like CLARA, the only difference between CLARA and CLARANS is the clustering process that is done after selecting the representative data sets.

12. What is Hierarchical method?

Hierarchical method groups all the objects into a tree of clusters that are arranged in a hierarchical order. This method works on bottom-up or top-down approaches.

Differentiate Agglomerative and Divisive Hierarchical Clustering?

Agglomerative Hierarchical clustering method works on the bottom-up approach. In Agglomerative hierarchical method, each object creates its own clusters. The single

Clusters are merged to make larger clusters and the process of merging continues until all the singular clusters are merged into one big cluster that consists of all the objects

Divisive Hierarchical clustering method works on the top-down approach. In this method all the objects are arranged within a big singular cluster and the large cluster is continuously divided into smaller clusters until each cluster has a single object.

What is CURE?

Clustering Using Representatives is called as CURE. The clustering algorithms generally work on spherical and similar size clusters. CURE overcomes the problem of spherical and similar size cluster and is more robust with respect to outliers.

Define Chameleon method?

Chameleon is another hierarchical clustering method that uses dynamic modeling. Chameleon is introduced to recover the drawbacks of CURE method. In this method two clusters are merged, if the interconnectivity between two clusters is greater than the interconnectivity between the objects within a cluster.

Define Density based method?

Density based method deals with arbitrary shaped clusters. In density-based method, clusters are formed on the basis of the region where the density of the objects is high.

What is a DBSCAN?

Density Based Spatial Clustering of Application Noise is called as DBSCAN. DBSCAN is a density based clustering method that converts the high-density objects regions into clusters with arbitrary shapes and sizes. DBSCAN defines the cluster as a maximal set of density connected points.

What do you mean by Grid Based Method?

In this method objects are represented by the multi resolution grid data structure. All the objects are quantized into a finite number of cells and the collection of cells build the grid structure of objects. The clustering operations are performed on that grid structure. This method is widely used because its processing time is very fastand that is independent of number of objects.

What is a STING?

Statistical Information Grid is called as STING; it is a grid based multi resolution clustering method. In STING method, all the objects are contained into rectangular cells, these cells are kept into various levels of resolutions and these levels are arranged in a hierarchical structure.

Define Wave Cluster?

It is a grid based multi resolution clustering method. In this method all the objects are represented by a multidimensional grid structure and a wavelet transformation is applied for finding the dense region. Each grid cell contains the information of the group of objects that map into a cell.

What is Model based method?

For optimizing a fit between a given data set and a mathematical model based methods are used. This method uses an assumption that the data are distributed by probability distributions. There are two basic approaches in this method that are

Statistical Approach

Neural Network Approach.

Name some of the data mining applications?

Data mining for Biomedical and DNA data analysis

Data mining for financial data analysis

Data mining for the Retail industry

Data mining for the Telecommunication industry

Define outlier.

Very often, there exist data objects that do not comply with the general behavior or model of the data. Such data objects, which are grossly different from or inconsistent with the remaining set of data, are called outliers.

What are the types of outlier detection method?

Statistical Distribution-Based Outlier Detection

Distance-Based Outlier Detection

Density-Based Local Outlier Detection

Deviation-Based Outlier Detection

What is Statistical Distribution-Based Outlier Detection?

The statistical distribution-based approach to outlier detection assumes a distribution or probability model for the given data set and then identifies outliers with respect to the model using a discordancy test. Application of the test requires knowledge of the data set parameters, knowledge of distribution parameters and the expected number of outliers

26.What is Density-Based Local Outlier Detection

Statistical and distance-based outlier detection both depend on the overall or ―global‖ distribution of the given set of data points, D. However, data are usually not uniformly distributed. These methods encounter difficulties when analyzing data with rather different density distributions

What is Deviation-Based Outlier Detection?

Deviation-based outlier detection does not use statistical tests or distance-based measures to identify exceptional objects. Instead, it identifies outliers by examining the main characteristics of objects in a group

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Data Warehousing and Data Mining : Clustering and Applications and Trends in Data Mining : Important Short Questions and Answers : Clustering and Applications and Trends in Data Mining |