random forest or gradient boosting?

Random forest and gradient boosting are leading data mining techniques. They are designed to improve upon the poor predictive accuracy of decision trees. Random forest is by far the more popular, if the google trends chart below is anything to go by.

[trends h=”450″ w=”500″ q=”+random+forest,+gradient+boosting”]

Correlation between predictors is the data miners’ bugbear. It is an inevitable fact of life in many situations. Multicollinearity can lead to misleading conclusions and degrade predictive power. A natural question is: Which approach handles multicollinearity better? Random forest or gradient boosting?

Suppose there are  n observations  \left\{ y \right\} and potential predictors  \left\{x_1 \cdots x_p \right\}. Assume that

(A)   \[ {y = x_1 +x_2 + \sigma\mathcal{N}} \]

where  \sigma is the amplitude of gaussian noise \quicklatex \mathcal{N} (mean zero and unit variance). Only 2 of  p potential predictors actually play a role in generating the observations. The  \left\{x_1 \cdots x_p \right\} are independently distributed (\quicklatex \mathcal{N}) with the exception of \quicklatex x_3 which is correlated with \quicklatex x_1  (correlation  \rho).

(B)   \[ x_3 =  \rho x_1 + \sqrt{1-\rho^2} \mathcal{N} \]

As the correlation  \rho increases, it becomes harder for a data mining algorithm to ignore \quicklatex x_3, even though \quicklatex x_3 is not present in (A) and it is not a “true” explanatory variable.

 

 

Variable importance charts for this class of problem show that gradient boosting does a better job of handling multicollinearity than random forest. The complex trees used by random forest tend to spread variable importance more widely, particularly to variables which are correlated with the “true” predictors. The simpler base learner trees of gradient boosting (4 terminal nodes in the above example) seem to have greater immunity from the evils of multicollinearity.

Random forest is an excellent data mining technique, but it’s greater popularity compared to boosting seems unjustified.

R code

 

 

One Comment

  • Hey,
    Great job, it really helped me.
    For me it works best to choose variables with GBM and making Random Forest with variables selected by GBM. The loss of overfitting is great and predictions on test sets improve.
    What do you think about this procedure?

    Regards,
    Piotr

Join the Discussion

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">


+ 3 = seven