Weight of Evidence (WoE) is a technique commonly used in risk modeling, mostly for variable selection in logistic regression and sometimes as a predictor for models, but it may be time to review its value.

It has an interesting statistical appeal: it allows to include non-linearity, removes the problems of outliers in independent variables, and is well adapted to the LN and exponentials used on logistic regressions. The simple correlations used by default in linear and logistic regressions tend to be problematic when we have non-linear relationships, and the recodification corrects this. However, its also generates various issues, which we will discuss.

## What is Weight of Evidence

For those unfamiliar with the concept, WoE is a recodification of variables in which we simply “bin” continuous variables (usually 10 groups), or use the categories of nominal variables, and use a KPI of classification error in each bin.

Basically: compute **SE** = the total number of events (1), and **SNE** = the total number of non-events (0). For each bin, compute:

## What is Information Value?

The information value (IV) is often used for variables selection, as popularized by Sidiqqi (Credit Risk Scorecards – Developing and Implementing Intelligent Credit Scoring, John Wiley and Sons, New Jersey, 2006). He proposes the following cut-offs:

Information Value | Variable Predictiveness |

Less than 0.02 | Not useful for prediction |

0.02 -> 0.1 | Weak Predictive Power |

0.1 -> 0.3 | Medium Predictive Power |

0.3 -> 0.5 | Strong Predictive Power |

>0.5 | Suspiciously good |

Note that the IV of course will depend on the number of bins (more bins, higher IV). When something is “suspiciously good”, it is often because of target bias, selection error, or leakage from the future (or it can simply be a *very good* variable but IV will mislead you). Let’s take a small example of selection error (there should be a filter applied) with the variable Age in a prediction model of “Taxable income amount” on the US Census 1994. Without any filters, we would get the following results with 5 bins (which is a small number of bins considering 90.000 rows, we could easily use 10 to 20 without any issues of degrees of freedom). We have:

As we see, the IV is quite high (0.708) and we can see that the categories 2-13 and 14-27 have a low count of “taxable income amount”. This is due to sample bias: our model basically would predict that children do not declare income superior to 50.000 USD per year. It doesn’t mean the variable is bad, in this case it means the data is improperly selected.

If we remove all records below 22 years old (normal age of finishing college) we now have a good variable

## Why is it misleading?

I can think of 8 reasons why WoE and IV should be relegated to the pile of interesting statistical methods that we no longer use in data science unless we really have to use them.

**Poor selection criteria**: A manual selection of a variable based on a univariate criteria is a bit… limitative, to say the least. For example, some variables may have a low univariate contribution, but a unique variance, hence provide stability and strength to a model. IV basically makes no sense if you use a stepwise backward technique.**Classification is not the goal**: The goal of a predictive model on unbalanced dataset (the vast majority of the models we build) is not really classification, it is ranking. When working with rare events, classification is misleading objective (see this post). As such, the*percentage of non-event*is irrelevant information and it is easier – and less misleading – to work with probability of target.**Interpretation is not straightforward**: Using the univariate variable importance determined by the univariate AUC is more telling. That is, the strength of a predictive model using only this variable. This is a stronger criterion, and easier to understand than WoE (we use the same metric for model creation and variable illustration).**Not a stable criteria**: When working with WoE, the IV is not very stable (it depends on the number of bins). For example, with Age we quickly go over the “0.5” limit depending on the number of bins. The number of cuts are arbitrary and manual, and we really can do better than that. Just a spline regression is already an improvement!

The RTT methods used in Timi Modeler uses a flexible grouping criterion to create initial linear models, which can be based on probability of finding targets, percentage of the dataset, or absolute rows. Bins are automatically regrouped for nominal variables that yield the same probability of target. As we use k-fold cross validation to create those bins, the risk of an unstable estimator is minimal. Additionally, when working with large dataset, it is common to have a lot of bins, or nominal variables that are very good predictor, and would be rejected by IV.

**The LN transformation damages the structure of the probabilities**, which is why you will find many recommendations of not having bins with low counts of target. As illustrated below, we see WoE shows an amplification of the impact of low probability. Where we only see a small variation of 2.8 in probability between the best and worse group, we have a 17% variation of WoE, resulting in a variable that will have an artificially increased weight in the model. Probability is much easier to explain than WoE.

- Corollary reason: WoE is
**only valid for a logistic regression**, where an LN transformation is performed. Timi Modeler, for example, uses LASSO and ElasticNet regressions with lots of linear models which do not depend on exponential transformation. xgBoost, Random Forest and other do not make ln transformations either. It is much easier (and faster) to work with linear transformations (Recoded To Target, or RTT). Making the WoE transformation would decrease the precision of the final model and take more time. - The most compelling reason is that on sparse data, WoE and
**IV start rejecting the best variables**. And this is worth exploring further.

## We ignore our best predictors!

Let’s compare the result of the Timi Model with a few IV

We see that:

- Capital Gains, Education, and Detailed Occupation Code should be rejected (by a large margin)
- Dividend from stock only has half the IV of age (although it has twice the weight in the multivariate model, and Age is only marginally better as a univariate predictor)

A quick glance at the distributions shows that there does not seem to be anything wrong with those variables in relation with the target, not in logic nor in distribution, and it would be a pity to eliminate them! (the red line is the probability of event in each bin)

## Control Overfitting

What about cases of overfitting where a variable should be rejected and Timi doesn’t do it? Like the case of Age we see above?

Let’s have a look at the original variable Age:

We immediately see two problems with it:

- The probability of event for Age<19 is constantly 0. There is a selection biais.
- There are some small “sawtooth” effects that can or cannot be explained. Most notably there is a drop of probability at age 46, and it goes up again at 51. We have two ways to solve this:
- Change the bin size and use the “smoothed” model

- Change the variable. The pattern in this case is not age, it is a birth cohort. We are dealing with data from 1994, this pattern is therefore people born between 1943 and 1948: when the US entered the European front of WWII. This affected the employment conditions of our cohort and we should create a new variable for this.

This information is much easier to understand than WoE and IV, leads to the same conclusion (faster: we see immediately that the cut should be 19, instead of the arbitrary 22 I used above) and helps generating additional questions and constructs which would not have been apparent without this graphic visualization. One may argue that a proper “Data Understanding” would have done the same, but such analysis is rarely done: typically, histograms group more bins, because the grouping doesn’t take the target into consideration, and a bivariate analysis is not commonly done either.

## Conclusions

In conclusion: spend 10% of the time you would have spent constructing your WoE and analyze the output of Timi: you’ll get a better model and generate more interesting questions. And don’t ever trust a model at face value, a proper understanding of all the relationships is important.

When should you still use it? If you are using logistic Regressions, with relatively small dataset, a fairly balanced target, and normally distributed independent variables, WoE is still an acceptable way to go and IV can help getting your model accepted by regulatory authorities who just won’t listen. Still, use caution: don’t put too much faith in this 0.5 and 0.02 thresholds, as it becomes irrelevant if you do a stepwise selection, and it should never be a “rule” (which it was never intended to be).

When using a regression on with “large” datasets (more than 40-50.000 rows and up into the millions), and with unbalanced dataset (usually less than 10% of target), the Recoding To Target used in Timi Modeler is a much better choice than WoE.