Sometimes the best model is no model

Data can be used to discover relationships between predictands and an assumed set of predictors, even in the absence of any prior theory. However lack of an underlying theory imposes a cost: the model-building procedure is subject to bias-variance. Namely, statistical models which fit the observed data very well (low bias) tend to perform badly on new data (high variance).

Predictive model selection is the art of finding the right compromise between over-fitting and under-fitting. Often practitioners rely on simulations to help guide heuristics and intuition for a particular problem. The following is a simple example.[*]

Suppose there are  n observations  \left\{ y \right\} and  n sets of potential predictors  \left\{x_1 .. x_p \right\}. The predictors are independently distributed  \mathcal{N}(0,1). Observations are generated by a linear process of the type

(A)   \[ {y = x_1 + .. +x_q + \sigma\mathcal{N}(0,1)} \]

where  p > q. In other words, only  q of  p potential predictors actually play a role in generating the observations. As the signal to noise ratio  \sigma of the model (A) increases, it becomes harder for multi-variate linear regression to pick out the “unknown” subset of  q predictors and discover a useful predictive model.

Take  n=100, p=50 and q=10. 400 independent observation datasets were generated using (A), fit and averaged over. The leaps package in R was used to select best subset regression fits for each of 0 to 50 regressors (using “forward” selection). An average predictive  R^2 was also calculated based on 10000 new observations for each fitted model. While the fit  R^2 is always positive, predictive “ R^2” can have either sign. Negative predictive  R^2 means that the model is “toxic”.

 

Complexity (number of regressors) increases from left to right in the above graph. As expected, there is a trade-off between model complexity and predictive power. For  \sigma =1 the optimal number of regressors is  q=10 and additional regressors (excess complexity) degrades the predictive power somewhat, but not disastrously. However as  \sigma grows the loss of predictive power with excess complexity increases. Eventually, above  \sigma \simeq 4.5 even the simplest models are toxic.

The graphs below shows simulated fit accuracy and predictive power versus  \sigma using various model selection methods. (lm=ordinary multivariate regression with no penalties, best 6= optimal subset of 6 predictors using exhaustive search etc.) 10-fold cross-validated ridge regression and lasso calculations were done using glmnet ( \alpha=0, 1 respectively).


Akaike and Bayesian information criteria do not perform well, leading to toxic models above  \sigma \approx 4. This is not too surprising because the theorems which justify these criteria are invalid when  p \sim n. It is striking that penalised regression + cross-validation suppress over-fitting correctly around  \sigma \sim 4. The lasso (which is capable of forcing regression coefficients to be strictly zero) gave the best performance overall. Lasso regression handles the fact that the best model can be no model at all.

The R code which does the model selection comparison is here.

[*] the simulations are related to a crop yield prediction problem.

June 13, 2013 · joe · No Comments
Posted in: R

20th Century Droughts

The Global Precipitation Climatology Project (GPCP)[1] of Deutscher Wetterdienst created a global dataset of monthly rainfall covering the years 1901-2011 at 0.5o resolution. It is based on rain gauge observations and was first published in 2011.

Rainfall patterns and variability vary greatly from one location to another. Moreover, the human eye is easily fooled by randomness in time-series data.  A convenient way to handle these issues is to replace the observed (highly non-normal and seasonally varying) rainfall amounts by a standardized index such that an index  value of say, -2, represents an extremely dry period (-2 standard deviations from the mean implies that rainfall is lower than the observed value only 2.1% of the time). Another advantage of using a Standardized Precipitation Index (SPI) is that a separate index can be defined with respect to each cumulative rainfall time-scale of interest. For instance, low values of the 3-month index means dry soil, whereas low values of the 12-month index could be consistent with either wet or dry soils, but may mean empty reservoirs and dry streams .

Plotting SPI as a function of both time and time-scale can give insight into nature of droughts at a particular location. The graphic highlights 20th century droughts (actually 1901-2012) for some agriculturally important places, extracted from the GPCP dataset.

The graphic includes some truly terrible historical events. For example, the Ukraine and Volga drought of 1920 and extensive period of drought in the Volga during the 1930s cost millions of lives.

The Russian droughts of 1935-39 coincided with the dust bowl years visible in the index data for Kansas. The 1950s drought in Kansas was at least as severe but it’s impacts were limited by improved agricultural practices and disaster response. The recent drought in Texas and Chihuahua province of Northern Mexico are the most extreme there for at least 50 years. The 2008 drought in Buenos Aires province of Argentina (which helped fuel food-price inflation globally) is without precedent in the GPCP record.

While droughts are an inevitable fact of life, it is also possible that the agricultural revolution which began in the 1960s took place during an unusually benign period. I hope to look into this in a future post.

[1] Schneider, Udo; Becker, Andreas; Finger, Peter; Meyer-Christoffer, Anja; Rudolf, Bruno; Ziese, Markus (2011): GPCC Full Data Reanalysis Version 6.0 at 0.5°: Monthly Land-Surface Precipitation from Rain-Gauges built on GTS-based and Historic Data.

SPI algorithm

There are many alternative ways to define a standardized precipitation index. In the approach used above , no functional form for the distribution of rainfall was assumed. SPI was computed based on the empirical cumulative distribution function( ecdf) only. ecdf can be defined from low  to high values values of precipitation P(>x) or from high to low values P(<x). The algorithm (R function)  given below symmetrises over these two alternatives. I think this is an improvement over the algorithm given in a previous post.

getSPI <- function(y){
 #empirical, symmetrized Standard Precipitation Index
 fit.cdf <- ecdf(y)
 fit.rcdf <- ecdf(-y)
 cdfs <- fit.cdf(y)
 rcdfs <- 1 -fit.rcdf(-y)
 #invert normal
 spi.t <- qnorm(cdfs)
 spi.tp <- na.omit(spi.t[ spi.t != Inf ]) #drop Inf
 #reversed
 rspi.t <- qnorm(rcdfs)
 rspi.tp <- na.omit(rspi.t[ rspi.t != -Inf ]) #drop Inf
 #symmetrise
 spi.sym <- (spi.t+rspi.t)/2
 spi.sym[which(spi.t == Inf)] <- rspi.t[which(spi.t==Inf)]
 spi.sym[which(rspi.t == -Inf)] <- spi.t[which(rspi.t==-Inf)]
 return(spi.sym)
}

April 11, 2013 · joe · No Comments
Posted in: Uncategorized