5. Detailed description of the Actions > 5.13. TA - R_Discovery Analytics > 5.13.2. Hierarchical Clustering (R action)

Icon:

Function: HIERARCHICAL

Property window:

ANATEL~4_img56

Short description:

Hierarchical Clustering.

Long Description:

This Action is mainly for explanatory/teaching purposes. If you want to create a better segmentation, you should use Stardust.

A classic algorithm that always works, as long as data is numerical and not too big.

Typically, hierarchical clustering is used in combination with K-Means, to find the optimum the number of segments. In general, it is not recommended to trust segments assignments made by hierarchical clustering.

The main limitations of Hierarchical clustering are:

•it needs to start by computing the distances betweem each point, hence creates an n x n matrix in memory. This will not work with “large” database (1000 records are ok. 10.000 is often a problem. 10.000 will require a large server with a LOT of ram)

•Curse of dimensionality: if you put a lot of variables, segments will not appear (everything will look equidistant)

•it is slow

•cluster centers are not dynamic: as we regroup, centers change and some points may become misclassified

While several methods are included in this action, it is best to used Ward.D2: This is the same distance estimation used in other popular statistical software (Stardust, SPSS, etc.), and it tends to give the clearest dendograms: compare Ward with Average Linkeage method: The latter tends to create segments of outliers and fails to provide a clear cut un terms of number of segments.

ANATEL~4_img58

To know how many segments to retain, one must “look for a large drop of information”, and explore the solutions of the various potentially “good” solutions.

Parameters:

Chart Title: Title of the plot. It will display “Hierarchical Clustering of “ TITLE “ with METHOD”

Variables to include in Clusters: select the columns on which to compute the clusters

Method: choose one of

Ward.D: Ward’s distance, Minimize the total variance of the clusters. Proximity between two clusters is the magnitude by which the distance in their joint cluster will be greater than the combined distance in these two clusters: SS12−(SS1+SS2)

Ward.D2: Squared Ward’s distance (the most common one), we use the sum-square instead of the distance

Single: Single Linkage follows the logic of “a friend of a friend is a friend”, in which points are assigned to the segment with the closest point

clip0248

Complete: Complete linkage follows the logic of “the one I hate the most is my friend”, points are assigned to the segments that have the least distant extreme, that is, the farthest point is the closest.

ANATEL~4_img61

Average: Average Linkage: we take the average distance of all points for all clusters, weighted by the number of points in each cluster. You’d expect good segments, they are often not that clear.

ANATEL~4_img63

Mcquitty: average, without the weight.

Median: (WPGMC) Similarity based on the median of each cluster (similar to K-Medoid) using Euclides’ distance

ANATEL~4_img65

Centroid: UPGMC, distance to the center of each cluster using Euclides´s distance.

ANATEL~4_img67

Direction: Select:

Downwards

Upwards

Left

Right

Try all methods and plot chart: run all method so we can choose which is best, visually

Save image as PNG: self explanatory

PNG Directory: specify the directory in which to save the PNG, the name will be the plot title

Row Labels: mandatory, select the variable with labels names, it can be the key.

Label Size: label size on the plot

Line Type:

ANATEL~4_img69

Dendogram Type:

ANATEL~4_img70 ANATEL~4_img71

ANATEL~4_img72 ANATEL~4_img73

ANATEL~4_img74

Number of Clusters: set the number of clusters to display in color on the plot (default is 1)

Cluster Name: name of the cluster

5.13.2. Hierarchical Clustering ( action)

5.13.2. Hierarchical Clustering ( action)