5.12.2. Hierarchical Clustering (clip0243 action)

<< Click to Display Table of Contents >>

Navigation:  5. Detailed description of the Actions > 5.12. R_Discovery Analytics >

5.12.2. Hierarchical Clustering (clip0243 action)

 

Icon: ANATEL~4_img57  

 

Property window:

 

ANATEL~4_img56

 

Short description:

Hierarchical Clustering

 

Long Description:

This Action is mainly for explanatory/teaching purposes. If you want to create a better segmentation, you should use Stardust.

 

A classic algorithm that always works, as long as data is numerical and not too big.

 

Typically, hierarchical clustering is used in combination with K-Means, to find the optimum the number of segments. In general, it is not recommended to trust segments assignments made by hierarchical clustering.

 

The main limiations of Hirarchical clustering are:
 

it needs to start by computing the distances betweem each point, hence creates an n x n matrix in memory. This will not work with “large” database (1000 records are ok. 10.000 is often a problem. 10.000 will require a large server with a LOT of ram)

it is slow

cluster centers are not dynamic: as we regroup, centers change and some points may become misclassified

 

While several methods are included in this action, it is best to used Ward.D2: This is the same distance estimation used in other popular statistical software (Stardust, SPSS, etc.), and it tends to give the clearest dendograms: compare Ward with Average Linkeage method: The latter tends to create segments of outliers and fails to provide a clear cut un terms of number of segments.

 

ANATEL~4_img58

 

 

To know how many segments to retain, one must “look for a large drop of information”, and explore the solutions of the various potentially “good” solutions.

 
Parameters:
 

Chart Title: Title of the plot. It will display “Hierarchical Clustering of “ TITLE “ with METHOD”
 

Variables to include in Clusters: select the columns on which to compute the clusters
 

Method: choose one of
 

Ward.D: Ward’s distance, Minimize the total variance of the clusters. Proximity between two clusters is the magnitude by which the distance in their joint cluster will be greater than the combined distance in these two clusters: SS12−(SS1+SS2)
 

Ward.D2: Squared Ward’s distance (the most common one), we use the sum-square instead of the distance
 

Single: Single Linkage follows the logic of “a friend of a friend is a friend”, in which points are assigned to the segment with the closest point

 

clip0248

 

 

Complete: Complete linkage follows the logic of “the one I hate the most is my friend”, points are assigned to the segments that have the least distant extreme, that is, the farthest point is the closest.

 

ANATEL~4_img61

 

Average: Average Linkage: we take the average distance of all points for all clusters, weighted by the number of points in each cluster.  You’d expect good segments, they are often not that clear.

 

ANATEL~4_img63

 

Mcquitty: average, without the weight.
 

Median: (WPGMC) Similarity based on the median of each cluster (similar to K-Medoid) using Euclides’ distance

 

ANATEL~4_img65

 

Centroid: UPGMC, distance to the center of each cluster using Euclides´s distance.

 

ANATEL~4_img67

 

 

Direction: Select:

 Downwards

 Upwards

 Left

 Right

 
Try all methods and plot chart: run all method so we can choose which is best, visually
 

Save image as PNG: self explanatory
 

PNG Directory: specify the directory in which to save the PNG, the name will be the plot title
 

Row Labels: mandatory, select the variable with labels names, it can be the key.
 

Label Size: label size on the plot
 

Line Type:
 

ANATEL~4_img69

 
 
Dendogram Type:

 

ANATEL~4_img70 ANATEL~4_img71

 

ANATEL~4_img72 ANATEL~4_img73

 

 

 

ANATEL~4_img74

 

 

 

 

Number of Clusters: set the number of clusters to display in color on the plot (default is 1)
 

Cluster Name: name of the cluster