5.5.1. Definition of the Distance to use inside the segmentation

<< Click to Display Table of Contents >>

Navigation:  5. How To use « TIMi – StarDust module »? > 5.5. How to create a segmentation? >

5.5.1. Definition of the Distance to use inside the segmentation

 

A segmentation is based on a “distance-definition”.

 

Using StarDust, you can create very complex distance-definition. For example, the following axis can be included inside you distance-definition:
 
 

On all continuous variables:

oNon-normalized original axis clip0037 of the data: select “ED:none” inside the “Normalize” column inside the large table on the “Standardization” tab inside the STARDU~1_img15 parameter window (this tab is NOT accessible with the STARDU~1_img16 button, only with the STARDU~1_img15 button. See illustration: number1

STARDU~1_img235

 

 

oNormalized-axis. There are two type of normalization:
 

Standard Normalization for inclusion inside an Euclidean-Distance:STARDU~1_img236

 
                 ... where STARDU~1_img237is the ith column of the dataset that has been normalized
 

                            clip0065 is the ith column of the dataset
 

                            STARDU~1_img239   is the mean of the ith column of the dataset
 

                            STARDU~1_img240is the standard devidation of the ith column of the dataset

                 Select “ED: mean centering divided by StdDev” in the same table: number3

 

Quantile Normalisation: STARDU~1_img241

Select “ED:quantile [0..1]” in the same table: number2
 

q(x) is an operator that gives as output a number between 0 and 1.
 

The q(x) operator on a column C of the database is computed this way:
 

Sort in increasing order all the numbers in C and remove all the duplicates. We obtain a new “sorted” column STARDU~1_img242.
 

STARDU~1_img243

 
 

Thus, q(x) is zero when x is the smallest number inside the column C
 

and q(x) is one  when x is the largest number inside the column C.

 

 
The Euclidean-distance between two rows of the dataset (between the two points A and B) is STARDU~1_img244 and is defined this way:
 

STARDU~1_img245

 

    ... where STARDU~1_img246 is a distance based on variables that only contains continuous numbers (values)
 

  STARDU~1_img247 is a distance based on nominal variables only.
 

              E is the set of “active” variables to include inside the Euclidean distance-definition.
 

  V is the set of non-normalized continuous variables
 

S is the set of continuous variables that have been normalized using the “Standard” normalization
 

Q is the set of continuous variables that have been normalized using the “Quantile” normalization
 

N is the set of Nominal variables
 

            P is the set of “active” PCA axises
 

  eq(x,y) is an operator that returns one if the string ‘x’ equals ‘y’ and zero otherwise.
 

            STARDU~1_img248  is the content of the column ‘i’ and row ‘A’ of the dataset
      STARDU~1_img249 and STARDU~1_img250 have been defined on the previous page.

 
 STARDU~1_img251 is the STARDU~1_img252 axis after a “Standard” normalization
 
  STARDU~1_img253and clip0068 are user-specified (column-)weights.

 

 
The default values are: the sets V,S,Q,E are empty, the set S contains all the continuous variables, the set P contains the first ten PCA axises, the weights STARDU~1_img150and STARDU~1_img256 are all one.

 

If some variables are included inside a “cosine-distance” (the C set is non-empty), then the distance-definition between two points A and B is:

 

clip0038

 

...where

clip0039

 

... and where C is the set of variables that are, at the same time, included inside the “cosine-distance” (column “normalize”) and that are “active” (column “In K-Means”).
 

STARDU~1_img257 is the number of variables inside the set
 
 STARDU~1_img258is a user-defined parameter that specified the relative weight of the cosine-distance compared to the Euclidean-distance.
 

STARDU~1_img259 is the average of the STARDU~1_img260and STARDU~1_img261 user-specified weights included inside the Euclidean-distance
 

STARDU~1_img262 is the mean of a vector composed by the value of the columns STARDU~1_img263  with STARDU~1_img264 on the row A
 

STARDU~1_img265 is the Standard deviation of a vector composed by the value of the columns STARDU~1_img266  with STARDU~1_img267on the row A

 

 
The default values are: the set C is empty, the weight STARDU~1_img268

 

You can define the sets E,C here: number1 You can define the sets V,S,Q,C here: number2 You can define clip0040 here: number3

 

 

pca2

 

You can define the sets E,C,P here:number1 You can define STARDU~1_img270here:number2

 

STARDU~1_img271

 

To help you to select the set P of the active PCA axises, you can click here: number3

 

When you click the STARDU~1_img272 button, the following window appears:

 

STARDU~1_img273

 

 
Please refer to the section 2.2.5 to know how to select to right number of PCA axises to include inside your “distance-definition”.

 
 

The default, initial setting for the “distance-definition” is equivalent to the following “distance-definition”:
STARDU~1_img274