Archive for the ‘Statistics’ Category

Murphy-Epstein’s Law

Predicting the value of a continuous real variable based on historical observations, covariates etc is a routine problem. It has never been easier to create sophisticated statistical models from data. Sadly however, it often turns out that the predictions of the fancy model are not much better than a simple mean of the historical observations.

The output of the best predictive models (determined by cross-validation for example) always shows less variance than the observations. This fact is called shrinkage.

Shrinkage can be understood from an identity known in weather forecasting as Murphy-Epstein decomposition[*].

    \[\text{forecast skill} = \rho^2 - \left( \rho - {\sigma_f \over \sigma_o} \right)^2  \]

\rho is the correlation between forecasts and observations, and \sigma_o and \sigma_f are standard deviations of the observations and forecasts respectively.

To maximise skill, the second term needs to be made as small as possible. For example, \rho \ll 1 requires \sigma_f \ll \sigma_o.

Having low variance compared to the observations may seem strange. It makes your predictive model seem a less realistic description of reality.  Yet shrinkage is a feature of any imperfect (\rho < 1) but optimised predictive model.


random forest or gradient boosting?

Random forest and gradient boosting are leading data mining techniques. They are designed to improve upon the poor predictive accuracy of decision trees. Random forest is by far the more popular, if the google trends chart below is anything to go by.

[trends h=”450″ w=”500″ q=”+random+forest,+gradient+boosting”]

Correlation between predictors is the data miners’ bugbear. It is an inevitable fact of life in many situations. Multicollinearity can lead to misleading conclusions and degrade predictive power. A natural question is: Which approach handles multicollinearity better? Random forest or gradient boosting?

Suppose there are  n observations  \left\{ y \right\} and potential predictors  \left\{x_1 \cdots x_p \right\}. Assume that

(A)   \[ {y = x_1 +x_2 + \sigma\mathcal{N}} \]

where  \sigma is the amplitude of gaussian noise \quicklatex \mathcal{N} (mean zero and unit variance). Only 2 of  p potential predictors actually play a role in generating the observations. The  \left\{x_1 \cdots x_p \right\} are independently distributed (\quicklatex \mathcal{N}) with the exception of \quicklatex x_3 which is correlated with \quicklatex x_1  (correlation  \rho).

(B)   \[ x_3 =  \rho x_1 + \sqrt{1-\rho^2} \mathcal{N} \]

As the correlation  \rho increases, it becomes harder for a data mining algorithm to ignore \quicklatex x_3, even though \quicklatex x_3 is not present in (A) and it is not a “true” explanatory variable.



Variable importance charts for this class of problem show that gradient boosting does a better job of handling multicollinearity than random forest. The complex trees used by random forest tend to spread variable importance more widely, particularly to variables which are correlated with the “true” predictors. The simpler base learner trees of gradient boosting (4 terminal nodes in the above example) seem to have greater immunity from the evils of multicollinearity.

Random forest is an excellent data mining technique, but it’s greater popularity compared to boosting seems unjustified.

R code