5.11.2. CART – Step 1 - Create one “deep” tree (clip0243 action)

<< Click to Display Table of Contents >>

Navigation:  5. Detailed description of the Actions > 5.11. R Predictive >

5.11.2. CART – Step 1 - Create one “deep” tree (clip0243 action)

 

Icon: ANATEL~4_img7  

 

Property window:

 

ANATEL~4_img6

 

Short description:

Create a tree using the CART algorithm

 

Long Description:

The CRT node uses Recursive Partitioning and Regresion Tree as implemented in the R rpart library, based on the 1984 book of Breiman, Friedman, Olshen and Stone. The library is the work of Terry Therneau, Beth Akinson, and Brian Ripley.

 

A complete introduction to the technique can be found in:

https://cran.r-project.org/web/packages/rpart/vignettes/longintro.pdf

 

This generates a CART model, and send the following output tables:
 

Output pin 0 : full table with prediction

Output pin 1: error table for selection of pruning parameters

Output pin 2: variable importances

 

The goal of this analysis is to generate a large tree and identify where to prune it (see next action node). While this algorithm has gained popularity in the early stage of data mining, it is now increasingly being left unused.

 

The main advantage of CART is it ability to combine nominal and continuous variable to predict (typically) nominal groups. Being non-parametric, it adapts to fairly large datasets, and does not suffer from multicollinearity or non-normality of the data. When it was introduced, it was also highlighted that it yields better predictions than CHAID.  Another key advantage is that it is intuitive, and offers a methodology similar to a drill down, but with some validation.

 

The key problems of CRT are that as trees deepen, we are almost certain to overfit the data (some argue that trees below three levels should mever be used, other go up to 6 or 7 levels). Indeed, as we see a succession of conditional probabilities, whatever comes out of the model becomes irrealistic, which is why many use combinations of hundreds of small trees (XGBoost, TreeNet, Forest of Stumps, etc.) to overcome this problem. This yields better models, but they are “black boxes”.

 

In practice, the first cut is often based on particularities of the sample, and the following cuts are dependent on the first one, making such models unstable for predictive purposes. They are, however, useful in terms of data exploration and understanding.

 

Parameters:
 

The CRT tree (Classification and Regression Tree) requires the following settings:
 

Predictors: list of independent variables we want to use to predict in the form of x1+x2+…+xk

Target: the variable we want to predict (binary, continuous, or multinomial)

Weight: the tree needs to be “balanced”, so it is a good idea to create a weight variable. If all categories are equal, this variable is a constant.

Job to do: classification or regression

Model Name: name of the model file. This is required to add pruning, or apply the model as is.

Plot Scree: select whether to plot the Scree Plot to identify where overfitting begins

Plot Tree: select whether to plot the treee

Plot Scree and Tree chars in a single window: self explanatory

Tree display: simplified or fancy. The simplified one is much faster to display, but will not show text categories properly.

 

 

Advanced Parameters:
 

Seed: set a seed value to be able to change the starting point, or reproduce a previous model

Maximum depth of the tree: set how many successive splits are allowed. Note that very deep trees are unstable (overfitting). Typically, more than 10 is not worth exploring. More

Minimum size to allow a split: crt weill not cut nodes than have less observations

Minimum size of a terminal node: when splitting, CRT will attempt to respect this size. If splitting a node leads to a terminal node smaller than this criterion, the split will not happen

Use surrogates in model: this option allows you to control what happens in case of missing value.

Display Only: an observation with a missing value for the primary split rule is not sent further down the tree.

NO Split for all missing: means use surrogates, in order, to split subjects missing the primary

variable; if all surrogates are missing the observation is not split.

Majority Rule (recommended): if all surrogates are missing, then send the observation in the majority direction. This corresponds to the recommendations of Breiman et.al (1984).

Maximum Competing Splits: the number of competitor splits retained in the output. It is useful to know not just which split was chosen, but which variable came in second, third, etc

Maximum Surrogate: the number of surrogate splits retained in the output. If this is set to zero, the

compute time will be reduced, since approximately half of the computational time (other than setup) is used in the search for surrogate splits. However, this will yield to a model harder to put into production.

 

In the log, you will see the details of the analysis. The Scree plot shows the shape of the error reduction (and gives you a pruning criterion):

ANATEL~4_img8

Once we observe a flat line, there is no improvement on the relative error, hence the tree is most likely overfitting by creating additional cuts. In the log, we can see details of this metric, where we see that two “good” stopping criteria could be 18, where it seems to stabilize, or 47 splits, based on the minimum error:

 

 

CP

nsplit

rel error

xerror

xstd

1

0.207568

0

1

1

0.00527

2

0.191278

1

0.79243

0.79243

0.005003

3

0.018632

2

0.60115

0.60771

0.004611

4

0.011306

6

0.52663

0.53087

0.004396

5

0.009137

7

0.51532

0.52335

0.004373

6

0.005584

8

0.50618

0.51878

0.004359

7

0.002423

9

0.5006

0.50392

0.004312

8

0.00223

11

0.49575

0.50032

0.0043

9

0.002123

17

0.48237

0.49931

0.004297

10

0.00203

18

0.48025

0.49931

0.004297

11

0.001384

19

0.47822

0.49557

0.004285

12

0.001206

22

0.47407

0.49142

0.004271

13

0.001015

33

0.45828

0.48874

0.004262

14

0.000969

34

0.45727

0.48869

0.004262

15

0.0009

35

0.4563

0.48929

0.004264

16

0.000715

37

0.4545

0.48565

0.004252

17

0.000692

41

0.45159

0.48325

0.004244

18

0.000554

47

0.44712

0.48205

0.00424

19

0.000538

48

0.44656

0.48463

0.004249

20

0.000531

51

0.44495

0.48708

0.004257

 

 

A chart with variable importance is also returned:

 

ANATEL~4_img9

You will also find this information in text format in the Anatella log-window.