Tuning random forest

Random forest is one of the standard approaches for supervised learning nowadays. One of its advantages is that it does not require tuning of the hyperparameters to perform good. But is that really true? Maybe we are not only interested in a good model but in the best model we could get…

Tuning requires a lot of time and computational effort and is still difficult to execute for beginners as there are no standardized ways and only few packages for tuning. With my new package tuneRanger I try to fill the gap for the random forest algorithm in R.


The installation of the R package can be done via devtools::install_github("PhilippPro/tuneRanger")


Here is a brief R-Code that shows how it works. We need also the mlr package to make it run. First a mlr task has to be created via makeClassifTask or makeRegrTask. After that the runtime of the tuning can be estimated with estimateTuneRFTime.

# We make an mlr task with the iris dataset here 
iris.task = makeClassifTask(data = iris, target = "Species")
# Rough Estimation of the Tuning time
# Tuning process (takes around 1 minute); Tuning measure is the multiclass brier score
res = tuneRanger(iris.task, measure = list(multiclass.brier))

The execution of the tuning can be done with the tuneRanger function. The task has to be passed as well as a measure that should be tuned. Which measures are available can be looked up on the mlr tutorial page. Typical measures in case of classification are the mean missclassification rate (also calles error rate), the AUC, the brier score and the logarithmic loss and the mean squared error in case of regression.

All other parameters are well defined and do not have to be changed. iters specifies the number of iterations (for each iteration one random forest is trained and evaluated), the default of 100 provide good results in general. Moreover the number of threads (num.threads) can be set, that means how many CPUs should be used. Also the number of trees can be specified via num.trees. Other parameters of the underlying ranger package can be fixed with the parameters and tune.parameters arguments.

The outcome is a recommendation vector for the hyperparameters and a model with the tuned hyperparameters.

How it works

The current algorithm works as follows:

  • Sequential model-based optimization is used as tuning strategy with 30 evaluated random points for the initial design and 70 iterative steps in the optimization procedure. mlrMBO is used internally for tuning.
  • The three parameters min.node.size, sample.fraction and mtry are tuned at once.
  • Out-of-bag predictions are used for evaluation, which makes it much faster than other packages and tuning strategies that use for example 5-fold cross-validation.
  • Classification as well as regression is supported.
  • The default measure that is optimized is the brier score for classification and the mean squared error for regression.

Benchmark study

In a benchmark study I compared the algorithm with some other tuning implemenations for random forests in R and with a standard random forest without tuning. The competitors are the following:

  • Different target performance measures for tuneRanger; the default is to tune the brier score for classification and mean squared error for regression. We also look at the versions that tune the AUC and the logarithmic loss in the case of classification.
  • Two packages that already perform tuning for random forests:
    • mlrHyperopt which uses also mlrMBO in the background and has predefined tuning parameters and tuning spaces for many supervised learning algorithms. We use the default.
    • caret implementation of ranger which performs automatically the tuning of the mtry parameter.
  • The standard random forest algorithm (from the ranger package), to see if we get better results than the default algorithm

Our benchmark study is conducted on several datasets from OpenML. We use the OpenML100 benchmarking suite and download it via the OpenML R package. For classification we only use datasets that have a binary target and no missing values. We classify the datasets into small, medium and big by executing tuneRanger on 10 cores and recording the runtime. If the runtime is less than 60 seconds, the dataset is classified as small, if it is between one minute and ten minutes as medium and datasets with a runtime bigger than 10 minutes are classified as big.

For the small and medium datasets we perform a 5-fold cross-validation and repeat it 10 times. The average results for these 30 datasets can be seen in the table below. The ending of tuneRanger specifies which measure was tuned.

  – Error rate – (Multiclass) AUC – Brier Score – Logarithmic Loss – Training Runtime
tuneRangerMMCE 0.0988 0.9060 0.1445 0.2464 193.5932
tuneRangerAUC 0.0991 0.9088 0.1456 0.2483 187.7843
tuneRangerBrier 0.0991 0.9069 0.1398 0.2351 183.6576
tuneRangerLogloss 0.0995 0.9073 0.1398 0.2338 178.1290
hyperopt 0.0979 0.9064 0.1440 0.2484 317.3986
caret 0.1039 0.9064 0.1515 0.2548 168.3151
ranger 0.1074 0.9041 0.1632 0.2747 3.9578

The average rank of all datasets can be seen in following table:

  – Error rate – (Multiclass) AUC – Brier Score – Logarithmic Loss – Training Runtime
tuneRangerMMCE 4.28 3.43 4.53 4.48 5.40
tuneRangerAUC 3.98 5.47 4.50 4.30 4.73
tuneRangerBrier 2.92 4.53 1.60 2.40 4.97
tuneRangerLogloss 3.33 4.67 2.17 2.20 4.17
hyperopt 3.18 3.27 4.33 5.05 4.90
caret 4.93 3.42 5.55 4.85 2.83
ranger 5.37 3.22 5.32 4.72 1.00

We see that on average the tuneRanger methods outperform the ranger package and the caret package for all measures. Also tuning the specific measure does on average always provide the best results among all classifiers. This is also true for the error rate where mlrHyperopt performs best, because mlrHyperopt tunes the error rate internally. It is only partly true if we look at the ranks. tuneRangerBrier has the best average rank for the mmce, not tuneRangerMMCE.

caret is on average better than ranger, but is clearly outperformed by the tuneRanger methods, showing that tuning only the mtry parameter is not enough.

mlrHyperopt is quite competitive. This is not surprising as it also uses mlrMBO for tuning like tuneRanger. The main disadvantage is regarding the runtime. On average it takes longer, as it does not use the out-of-bag method for evaluation like tuneRanger but 5-fold cross-validation, which takes 5 times longer. This does not play a role for smaller datasets as the execution of the tuning can be done in less than ten minutes, but plays an important role for bigger datasets.

Below we see a graph that shows the average runtime for the algorithms.


We see that for smaller datasets the runtime of mlrHyperopt is smaller than tuneRanger, but when runtime increases it gets worse and worse compared with the tuneRanger algorithm. Because of this I think that tuneRanger is preferable especially for bigger datasets, when runtime also plays a more important role.

In a following blog post I will post results for bigger datasets.

Written on November 23, 2017
comments powered by Disqus