Modelling Australian crop yields

Model building is an art not a science. However a manual approach is impractical when the number of candidate models is large. Fortunately, simple but powerful statistical tools exist for automated model selection. This post illustrates the “brute force” approach to model selection using a real-world example.

Australian climate

Australia’s climate is strongly influenced by natural variability of sea temperatures in the tropical pacific El Nino/Southern Oscillation (ENSO).  It is known “warm” phase of ENSO (which brings cooler waters to North Eastern Australia) tends to produce drier conditions in Australia and visa versa. The chart shows the monthly Nino 3.4 index from the UK Met Office beginning in 1870.

 

Australian crop yields

Wheat yield data from 1861 (Australian Bureau of Statistics) are shown below. The top row shows yield values and fitted smooth trend lines for five producer states. The  bottom row shows departures from trend yield as a % of trend yield (or “returns”) . The influence of ENSO means that yields are unusually volatile e.g. New South Wales ~ 30%.  Western Australia yields are less volatile ~ 18%.

 

Model

The goal is to identify a good statistical model relating wheat yields to ENSO.  All of the 12 monthly ENSO indices of the agricultural year (July to June)  are potential predictor variables. The times-series of yield “returns” is approximately stationary and is taken as response variable. These data are combined into a dataframe wheat with the following structure:

              state year yield residual return   Jul   Aug   Sep   Oct   Nov   Dec  Jan   Feb   Mar   Apr   May   Jun
1        Queensland 1871  0.92    -0.21  -18.6 -0.92 -0.80 -0.48 -0.87 -0.73 -0.78 -0.2 -0.56 -0.38 -0.43 -0.62 -0.44
2   South Australia 1871  0.77     0.23   42.4 -0.92 -0.80 -0.48 -0.87 -0.73 -0.78 -0.2 -0.56 -0.38 -0.43 -0.62 -0.44
3   New South Wales 1871  0.45    -0.44  -49.3 -0.92 -0.80 -0.48 -0.87 -0.73 -0.78 -0.2 -0.56 -0.38 -0.43 -0.62 -0.44
4          Victoria 1871  0.68    -0.25  -26.9 -0.92 -0.80 -0.48 -0.87 -0.73 -0.78 -0.2 -0.56 -0.38 -0.43 -0.62 -0.44
5 Western Australia 1871  0.80    -0.05   -5.9 -0.92 -0.80 -0.48 -0.87 -0.73 -0.78 -0.2 -0.56 -0.38 -0.43 -0.62 -0.44
................

With the above structure, a list of regression models for each state is obtained using lm and dlply from R’s plyr package:

model0 <- dlply(wheat,.(state), lm, formula = return~0+Jul+Aug+Sep+Oct+Nov+Dec+Jan+Feb+Mar+Apr+May+Jun)

The structure of these models can be visualized as a heat map of t-values. t-values are just regression coefficients normalized by their standard errors. For example  t-values of magnitude less than 1 suggest a less statistically significant relationship, negative t-values indicate negative correlation etc. The t-value heat map for model0 is shown on the left panel (“All”) of the graph below. It is a fairly clear example of overfitting.

A search for more meaningful models involves choosing subsets of the monthly ENSO indices as predictors. This leads to a large number 212 – 1 = 4095 of potential models. Brian Ripley’s step() function automates the process of selecting the optimal model based on an “information” criterion such as Akaike Information Criterior (AIC). AIC penalizes models with more parameters.

lm.auto <- function(df){
lm.0 <- lm(return~0+Jul+Aug+Sep+Oct+Nov+Dec+Jan+Feb+Mar+Apr+May+Jun, data=df)
lm.s <- step(lm.0,k=2,trace=0)
return(lm.s)
}
model1 <- dlply(wheat,.(state), lm.auto)

t-values selected by lm.auto are shown in the centre panel. A harsher criterion (BIC, k=5) leads to the simplest models shown in the right-hand panel.

 

 

t-value heat maps suggest that December ENSO is a good predictor of wheat yields for New South Wales, Victoria and South Australia.  For Queensland, October ENSO is better. The situation in Western Australia is more complex…

Graphics created using ggplot2.

Note added 30 Aug 2012

A reader asked how the ENSO phase plot was made. Put your ENSO index data in long form with +ve index values in a nino column and -ve values in a nina column.

   
year month nino  nina
1871   Jul    0 -0.92
1872   Jul    0 -0.48
1873   Jul    0 -0.20
...

geom_ribbon then does the trick:
g <- ggplot(nino.long, aes(x = year)) g<- g + geom_ribbon(aes(ymin = nina, ymax = 0,fill="nina")) + geom_ribbon(aes(ymin = 0, ymax = nino,fill="nino")) g <- g + scale_fill_manual(name="",values=c("#4393C3","#D6604D")) g <- g+ facet_grid(.~month) g+ opts(title="ENSO 1871-2012")+coord_flip()+ylab("Nino 3.4 Index")