# The Instability of Forward and Backward Selection

29 May 2016Classical statistics often assumes that the analyst knows which variables are important and which variables are not. Of course, this is a strong assumption, and therefore many variable selection procedures have been developed to address this problem. In this blog post, I want to focus on two subset selection methods, and I want to address their instability. In other words, I want to discuss how **small changes** in the data can lead to **completely different solutions**.

For the sake of clarity, let’s focus on a simple linear regression model with \( p \) covariates:

\[E(Y \mid X_1, \ldots, X_p) = \beta_0 + \beta_1 X_1 + \ldots + \beta_p X_p.\]As the analyst, we were given these \( p \) variables to analyse, but we don’t necessarily know which ones are relevant in this model. Subset selection methods look at all possible \( 2^p - 1 \) models you can get from selecting a subset of these \( p \) variables and tries to find the most relevant model for the analysis. Graphically, we can arrange all these models in a lattice: the null model (containing only the intercept term \( \beta_0 \)) is at the bottom, the full model (containing all \( p \) variables) is at the top, and two nodes corresponding to two models are connected if they have all but one variable in common. For example, the following models are connected:

\[E(Y \mid X_1) = \beta_0 + \beta_1 X_1, \qquad E(Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2;\]but the following models are **not** connected:

By “subset selection” methods, I mean methods that search this lattice in a discrete fashion, i.e. they compare some of these models using a criterion.

## Information criteria

One such method is called the *all-subset selection*. This method is typically based on information criteria (but other criteria can be used). The idea is simple: for **all** possible models, compute a criterion and select the model that optimises this criterion. As you can imagine, this method becomes quickly impractical: with only 10 variables, we already have 1023 different possible models; with 20 variables, we have over a million possible models. For this reason, all-subset selection is not very popular.

## Forward and Backward selection

Instead I will focus on Forward and Backward selection, which are very popular approaches to model selection. Their distinguishing feature is that they are procedure that move through the lattice of all possible models until no “good” move is left. I will illustrate these methods using a well-known dataset on prostate cancer:

Looking at the documentation for this dataset:

```
These data come from a study that examined the correlation between the level of prostate specific antigen and a number of clinical measures in men who were about to receive a radical prostatectomy. It is data frame with 97 rows and 9 columns.
```

We will look at the mean cancer volume (on the log scale) as a function of the other 8 clinical features (so there are 255 possible models).

As we can see, the prostate specific antigen (PSA) levels and capsular penetration (both on the log scale) are the most significant variables. In forward selection, we start with the null model (only the intercept) and we look at all available variables. Adding them one at a time, we decide which one is the most relevant to the model, and consider it part of our model. We then look at the remaining terms and decide if we can improve our model by adding one more variable. We stop when including another variable does not improve our model anymore.

We will use this approach with the Prostate dataset:

The most important variable in the dirst stage was, no surprise, the PSA level. But we also ended up adding three mode variables: capsular penetration, age, and benign prostatic hyperplasia amount (also on the log scale). Therefore, this is the model selected using Forward selection.

Backward selection is very similar, but we start with the full model and decide which variable is the *least* relevant to the model. We then continue removing variables until doing so decreases significantly the quality of our model. Using this approach with the Prostate dataset:

The least relevant variable is the weight of the prostate. But we only ended up removing one more variable (seminal vesicle invasion), and therefore we can see that forward and backward selection **do not** lead to the same model. Which in itself can be a problem: which one should we choose?

## Perturbation and Instability

In a 1996 Annals of Statistics paper, Leo Breiman described several undesirable properties that subset selection methods have. I will focus on only one of them: **instability**. By instability, I mean that small changes in the data can lead to the selection of a different model.

We will investigate this phenomenon through simulations. I will randomly select one observation and change its value for the response variable. I repeat this process 500 times, and I look at the percentage of time each variable was selected in the model. So I am only changing **one number** in the whole dataset.

First, the good news: PSA levels and capsular penetration are **always** selected by forward selection. However, age and benign prostatic hyperplasia amount are selected only about 50% of the time. And remember that we’re only changing one number!

We can also look at backward selection:

Here the problem is even worse: age is selected 80% of the time, benign prostatic hyperplasia amount is selected two times out of three, and both gleason score variables are selected only half the time.

## Penalization methods

This instability issue is common to every subset selection methods (stepwise selection is another such method). This is essentially a consequence of their *discrete* nature.

Another approach to model selection is based on regularization/penalization procedures, such as lasso and elastic net. These approaches search through the lattice of all possible models using one or two **continuous** parameters. As such, they are typically less sensitive to perturbations of the data. This was already pointed out by Breiman in his discussion of ridge regression (which is *not* a model selection method, but it is a regularization procedure).