# Overview of statistical tests

Here an overview table of statistical tests that I use for the consultation of medical doctors. Easy-to-use and easy to understand. I usually also hand it over to them. (PDF)

Statistician, Data Scientist, Football Player, Alpinist

Here an overview table of statistical tests that I use for the consultation of medical doctors. Easy-to-use and easy to understand. I usually also hand it over to them. (PDF)

In this article I will present a benchmark analysis of different implementations of support vector machines in R.
The three packages that I will compare are the most popular package **e1071**, the also well known package **kernlab** and the less well known package **liquidSVM**.

The support vector machine (SVM) is a very different approach for supervised learning than decision trees. In this article I will try to write something about the different hyperparameters of SVM.

Random forest is one of the standard approaches for supervised learning nowadays. One of its advantages is that it does not require tuning of the hyperparameters to perform good. But is that really true? Maybe we are not only interested in a good model but in the best model we could get…

Tuning requires a lot of time and computational effort and is still difficult to execute for beginners as there are no standardized ways and only few packages for tuning. With my new package tuneRanger I try to fill the gap for the random forest algorithm in R.

I just updated the code from a previous post where I analysed the download statistics of different random forest packages in R, see the code at the bottom of the article. I calculated the number of cran downloads in march 2016 and march 2017.

The number of download of different packages containing the random forest algorithm in march 2016 and in march 2017:

package |
–march 2016 |
–march 2017 |
–ratio of 2017/2016 |
---|---|---|---|

randomForest | 29360 | 55415 | 1.89 |

xgboost | 4929 | 12629 | 2.56 |

randomForestSRC | 2559 | 6106 | 2.39 |

ranger | 1482 | 5622 | 3.79 |

Rborist | 298 | 441 | 1.48 |

What is clearly visible is the general increase in downloads of all packages that contain the standard random
forest. The biggest gain in popularity could achieve the **ranger** package that allows to run the random
forest in parallel on a simple machine. **xgboost** and **randomForestSRC** got a bigger increase than
the standard **randomForest** package.

There are already some benchmarking studies about different classification algorithms out there. The probably most well known and
most extensive one is the
Do we Need Hundreds of Classifers to Solve Real World Classication Problems?
paper. They use different software and also different tuning processes to compare 179 learners on more than 121 datasets, mainly
from the UCI site. They exclude different datasets, because their dimension
(number of observations or number of features) are too high, they are not in a proper format or because of other reasons.
There are also summarized some criticism about the representability of the datasets and the generability of benchmarking results.
It remains a bit unclear if their tuning process is done also on the test data or only on the training data (page 3154).
They reported the random forest algorithms algorithms to be the best one (in general) for multiclass classification datasets and
the support vector machine (svm) the second best one. On binary class classification tasks neural networks also perform
competitively. They recommend the R library **caret** for choosing a classifier.

Other benchmarking studies use much less datasets and are much less extensive (e.g. the Caruana Paper). Computational power was also not the same on these days.

In my first approach for benchmarking different learners I follow a more standardized approach, that can be easily
redone in future when new learners or datasets are added to the analysis.
I use the R package **OpenML** for getting access to OpenML datasets and the R package **mlr** (similar to caret, but more extensive) to have a standardized interface to machine learning algorithms in R.
Furthermore the experiments are done with the help of the package **batchtools**,
in order to parallelize the experiments (Installation via **devtools** package: devtools::install_github(“mllg/batchtools”)).

The open data platform OpenML offers a lot of dataset to evaluate algorithms. It offers datasets that can be used for classification and regression. It is accessible via the web page or via several programming languages like Java, Python and R.

I am performing a benchmark study with datasets from OpenML and classification and regression algorithms from the mlr package, which provide standardized interface to many of the algorithms implemented in R and easy evaluation with resampling strategies like cross-validation.

The comparison can be found on this github repository: benchmark-mlr-openml.

I will soon post a summary about the results.

Another comparison just between random forest and logistic regression is done by Raphael: BenchmarkOpenMl.

As mentioned in the previous post I will write a bit about the random feature selection in random forest. In the training step at each split in a random forest k features are selected at random from all features. For these features the ideal split according to a split criteria is chosen and the feature which performs best under all features is chosen as feature. The number k should not be set too high, so that not always the same features are chosen, but also not too small, so that at least some relevant features come into the comparison. It also depends highly on the dataset. If in the dataset there are only few features with relevance, the number k should be set sufficiently high, and vice versa.

The introduction of randomness in the feature selection process seems to be advantageous in many cases. The default in
many packages like **randomForest** is to choose k as the square root of the number features.

Bagging is one of the core principles of random forests and is a highly advantegous modeling strategy. I will describe now how bagging can be done and some of its advantages and disadvantages. The underlying base models can be of any kind, in random forests decision trees are the base models.

In bagging several base models are trained and for each model only a subset of the training observations is used to train each model, which are also called bagged models.

In my last post I provided a small list of some R packages for random forest. Today I will provide a more complete list of random forest R packages. In the first table I list the R packages which contains the possibility to perform the standard random forest like described in the original Breiman paper.

package |
RStudio downloads in the last month |
---|---|

randomForest | 28353 |

xgboost | 4537 |

randomForestSRC | 2291 |

ranger | 1347 |

Rborist | 284 |

Random forests were formally introduced by Breiman in 2001. Due to his excellent performance and simple application, random forests are getting a more and more popular modeling strategy in many different research areas.

Random forests are suitable in many different modeling cases, such as classification, regression, survival time analysis, multivariate classification and regression, multilabel classification and quantile regression.

An overview of existing random forest implementations and their speed performance can be found in the ranger documentation, altough this list is not exhaustive and many new implementations are comming up. The performances of models build with different packages slightly differ, depending on how the random forest algorithm was implemented.

Now I will present some random forest implementations in R. A good site to find all R packages to one specific topic is Metacran.

Hello out there!

This is my first blog post on this blog.

I will post here regularly about topics related to tree-based methods in machine learning.