5. How To use « TIMi – StarDust module »? > 5.5. How to create a segmentation? > 5.5.3. The Ward’s algorithm > 5.5.3.1. How to find the right number of segments?

To find the right number of segments inside your dataset, you have to look at the dendogram representation of the Ward’s algorithm:

clip0051

This tree represents the hierarchical aggregation process. You read it from left to right.

For example, you can see here: that the segments 5 and 8 (that originates from the KMeans algorithm) are “grouped together” into a new segment named ‘A’. Each node of the tree represents an aggregation between two segments (and it represents also the resulting segment). Thus, each node represents one iteration of the Ward’s algorithm.

The segments ‘A’ and the segments ‘B’ are grouped together to form the new segment ‘C’.

The segments ‘C’ and the segments ‘11’ are grouped together to form the new segment ‘D’.

The segments ‘D’ and the segments ‘3’ are grouped together to form the new segment ‘E’.

The horizontal distance represents the distance represents the distance between the two segments that will be aggregated together: see the illustration in blue. You can see on the dendogram that the first “aggregations” are involving segments that are really close to each other. Only the last three “aggregations” are grouping together segments that are far away from each other. Thus, from the dendogram, we deduce that the optimal number of segment for this dataset is “4”: see the illustration in green.

You can click anywhere on the dendogram to select the corresponding final number of segments. For example, in the illustration above, the vertical red line indicates that the user has decided to have “4” segments (because the vertical red line crosses 4 horizontal black lines).

When you change, inside the Ward’s algorithm, the final number of segments, you can see, in real-time, the colour of the points changing inside the main 3D display of StarDust. Thus, you can instantaneously validate your decision. This is unique to StarDust. Other datamining softwares are forcing you to wait for possibly several minutes (sometime hours!) before being able to validate how many segments you want.