Data structure Data matrix (two modes) object by variable Structure

__TYPE OF
DATA IN CLUSTERING ANALYSIS __

**Data structure Data matrix (two
modes) object by variable Structure**

**Dissimilarity matrix (one mode)
object –by-object structure**

We describe how object dissimilarity can be computed for object by
Interval-scaled variables,^{}

^{ }

Binary variables, Nominal, ordinal, and ratio variables, Variables of
mixed types^{}

^{ }

Interval-Scaled
variables (continuous measurement of a roughly linear scale) Standardize data

Using mean absolute deviation is more robust than using standard
deviation^{}

^{ }

Similarity and Dissimilarity Between Objects^{}

^{ }

Distances are normally used to measure the similarity or dissimilarity
between two data objects^{}

^{ }

Some popular ones include: *Minkowski
distance*:^{}

Also, one can use weighted distance, parametric
Pearson product moment correlation, or other dissimilarity measures^{}

__Binary Variables__

A
contingency table for binary data

Distance
measure for symmetric binary variables:

Distance
measure for asymmetric binary variables:

Jaccard
coefficient (*similarity* measure for *asymmetric* binary variables):

__Categorical variables__

A
generalization of the binary variable in that it can take more than 2 states,
e.g., red, yellow, blue, green

Method 1:
Simple matching

*m*: # of
matches,* p*: total # of variables

Method 2: use a large number of
binary variables

creating a new binary variable for each of the *M* nominal states

__Ordinal Variables__

An ordinal variable can be discrete or continuous^{}

^{ }

Order is important, e.g., rank^{}

^{ }

Can be treated like interval-scaled^{}

^{ }

replace *x _{if}* by their rank

^{ }

map the
range of each variable onto [0, 1] by replacing *i*-th object in the *f*-th
variable

compute the dissimilarity using methods for
interval-scaled variables

__Ratio-scaled variable__**:**

a
positive measurement on a nonlinear scale, approximately at exponential scale,
such as *Ae ^{Bt}* or

_{Ae}*-Bt*

Methods:

treat them like interval-scaled variables—*not a good choice!* (why?—the scale can
be distorted)^{}

^{ }

apply logarithmic transformation *y _{if}* =

^{ }

treat them as continuous ordinal data treat their
rank as interval-scaled^{}

__Variables of Mixed Types__

A
database may contain all the six types of variables symmetric binary,
asymmetric binary,

nominal,
ordinal, interval and ratio

One may
use a weighted formula to combine their effects

Vector
Objects

Vector
objects: keywords in documents, gene features in micro-arrays, etc.

Broad
applications: information retrieval, biologic taxonomy, etc.

Cosine
measure

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Data Warehousing and Data Mining : Clustering and Applications and Trends in Data Mining : Type of Data in Clustering Analysis |

**Related Topics **

Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.