Data can be used to discover relationships between predictands and an assumed set of predictors, even in the absence of any prior theory. However lack of an underlying theory imposes a cost: the model-building procedure is subject to bias-variance. Namely, statistical models which fit the observed data very well (low bias) tend to perform badly on new data (high variance).
Predictive model selection is the art of finding the right compromise between over-fitting and under-fitting. Often practitioners rely on simulations to help guide heuristics and intuition for a particular problem. The following is a simple example.[*]
Suppose there are observations
and
sets of potential predictors
. The predictors are independently distributed
. Observations are generated by a linear process of the type
(A)
where . In other words, only
of
potential predictors actually play a role in generating the observations. As the signal to noise ratio
of the model (A) increases, it becomes harder for multi-variate linear regression to pick out the “unknown” subset of
predictors and discover a useful predictive model.
Take and
. 400 independent observation datasets were generated using (A), fit and averaged over. The leaps package in R was used to select best subset regression fits for each of 0 to 50 regressors (using “forward” selection). An average predictive
was also calculated based on 10000 new observations for each fitted model. While the fit
is always positive, predictive “
” can have either sign. Negative predictive
means that the model is “toxic”.
Complexity (number of regressors) increases from left to right in the above graph. As expected, there is a trade-off between model complexity and predictive power. For the optimal number of regressors is
and additional regressors (excess complexity) degrades the predictive power somewhat, but not disastrously. However as
grows the loss of predictive power with excess complexity increases. Eventually, above
even the simplest models are toxic.
The graphs below shows simulated fit accuracy and predictive power versus using various model selection methods. (lm=ordinary multivariate regression with no penalties, best 6= optimal subset of 6 predictors using exhaustive search etc.) 10-fold cross-validated ridge regression and lasso calculations were done using glmnet (
0, 1 respectively).
Akaike and Bayesian information criteria do not perform well, leading to toxic models above . This is not too surprising because the theorems which justify these criteria are invalid when
. It is striking that penalised regression + cross-validation suppress over-fitting correctly around
. The lasso (which is capable of forcing regression coefficients to be strictly zero) gave the best performance overall. Lasso regression handles the fact that the best model can be no model at all.
The R code which does the model selection comparison is here.
[*] the simulations are related to a crop yield prediction problem.