Bagging

Bagging is one of the core principles of random forests and is a highly advantegous modeling strategy. I will describe now how bagging can be done and some of its advantages and disadvantages. The underlying base models can be of any kind, in random forests decision trees are the base models.

Training phase

In bagging several base models are trained and for each model only a subset of the training observations is used to train each model, which are also called bagged models.

The subsets can be drawn in different ways:

For each model draw randomly observations with replacement
For each model draw randomly observations without replacement
Divide the training dataset into several subsets and train each model with a different subset

Furthermore:

The number of observations that is drawn for each model has to be specified.
And of course the number of models has to be specified; usually more models are better, but in special cases this can be wrong, see here for an example

Which of the strategies for drawing observations is best, is an open issue and can differ from dataset to dataset.

Prediction phase

In the prediction phase the predictions of the models are aggregated. This can be done in many different ways.

In classification problems the most common class can be predicted (majority vote) or the fraction of the models that voted for each class can be regarded. In trees and other modeling strategies that provide probabilty estimations, it is also possible to average the probability of the classes that where predicted (in trees the fraction of observations of a class in the reached leaf node of a tree).

In regression problems it can be aggregated by taking the mean of all models or the median or some other desired statistic. Also quantiles can be extracted. A cumulative distribution function can be estimated with these quantiles and thus also a density. Taking quantiles is a way to get more informations out of the aggregated models.

There are several advantages of bagging:

As different subsets of the observations are used for the models, the models emphasize on different aspects of the training dataset.
The variance of the prediction is reduced as several models with different subsets are trained and averaged. For more informations about this see the pages 282 to 288 of The Elements of Statistical Learning. This is also related to stacking, a related modeling strategy.
Out-of-bag predictions can be build. For each observation all the models where this observation was not in, can be used to predict this observation. This is usually provided by standard packages of random forests.
These out-of-bag predictions can be used for predicting some performance measures (like the mean missclassification error) of the complete model
- it is usually much faster than making a typical resampling strategy like k-fold cross validation
- it is biased because less trees are used for each observation than in the whole model. Usually more trees are better, so usually it is biased in the negative direction.
They also could be used for tuning hyperparameters as it is a much faster strategy than other resampling strategies. -> One drawback: The strategy to build the subsets for drawing the observations for the models can be difficult to tune, because they influence the prediction performance also in the out-of-bag prediction phase because for example higher or lower number of models are used for predicting each observation. This problem diminishes for a higher number of models as the performance measures usually converge with a growing number of models.

One disadvantage may be the more difficult interpretation of bagged models compared to single models. This can be handled with advanced interpretation strategies like partial prediction plots.

More randomness in random forests are introduced by the random draw of features in each split. This will be topic in one of the following posts.

Written on April 10, 2016