Sometimes the best model is no model

Data can be used to discover relationships between predictands and an assumed set of predictors, even in the absence of any prior theory. However lack of an underlying theory imposes a cost: the model-building procedure is subject to bias-variance. Namely, statistical models which fit the observed data very well (low bias) tend to perform badly on new data (high variance).

Predictive model selection is the art of finding the right compromise between over-fitting and under-fitting. Often practitioners rely on simulations to help guide heuristics and intuition for a particular problem. The following is a simple example.[*]

Suppose there are  n observations  \left\{ y \right\} and  n sets of potential predictors  \left\{x_1 .. x_p \right\}. The predictors are independently distributed  \mathcal{N}(0,1). Observations are generated by a linear process of the type

(A)   \[ {y = x_1 + .. +x_q + \sigma\mathcal{N}(0,1)} \]

where  p > q. In other words, only  q of  p potential predictors actually play a role in generating the observations. As the signal to noise ratio  \sigma of the model (A) increases, it becomes harder for multi-variate linear regression to pick out the “unknown” subset of  q predictors and discover a useful predictive model.

Take  n=100, p=50 and q=10. 400 independent observation datasets were generated using (A), fit and averaged over. The leaps package in R was used to select best subset regression fits for each of 0 to 50 regressors (using “forward” selection). An average predictive  R^2 was also calculated based on 10000 new observations for each fitted model. While the fit  R^2 is always positive, predictive “ R^2” can have either sign. Negative predictive  R^2 means that the model is “toxic”.

 

Complexity (number of regressors) increases from left to right in the above graph. As expected, there is a trade-off between model complexity and predictive power. For  \sigma =1 the optimal number of regressors is  q=10 and additional regressors (excess complexity) degrades the predictive power somewhat, but not disastrously. However as  \sigma grows the loss of predictive power with excess complexity increases. Eventually, above  \sigma \simeq 4.5 even the simplest models are toxic.

The graphs below shows simulated fit accuracy and predictive power versus  \sigma using various model selection methods. (lm=ordinary multivariate regression with no penalties, best 6= optimal subset of 6 predictors using exhaustive search etc.) 10-fold cross-validated ridge regression and lasso calculations were done using glmnet ( \alpha=0, 1 respectively).


Akaike and Bayesian information criteria do not perform well, leading to toxic models above  \sigma \approx 4. This is not too surprising because the theorems which justify these criteria are invalid when  p \sim nIt is striking that penalised regression + cross-validation suppress over-fitting correctly around  \sigma \sim 4. The lasso (which is capable of forcing regression coefficients to be strictly zero) gave the best performance overall. Lasso regression handles the fact that the best model can be no model at all.

The R code which does the model selection comparison is here.

[*] the simulations are related to a crop yield prediction problem.

0

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.