Data Visualisation with Markdown, Flexdashboard and Shiny

In most data projects it is useful and necessary to visualize your data and your results. Different tools exist for this in the R-universe and it depends on your purpose what is most suitable for you.

In the following I will present some data visualization packages and tools that can help you to get the best visualization out of your data.

Shiny: Fast Data Loading with fst

I had several projects where I had to load in a big dataset for my shiny app. This loading was usually done in the beginning and would take more than 3 minutes. My target was to reduce this time. I starting thinking about the problem and discovered, that not the whole dataset is required when I start the app.

Guidelines for writing good R code

These guidelines are recommendations and are not meant to be obligatory. Many of the principles are useful and help working and collaborating more efficiently with R. Feel free to add your recommendations or remarks in the discussion section below.

Machine Learning Strategy (Part 3)

In this third and last blog post about machine learning strategy I will talk about the problems of different distributions of train, development and test set and about learning from multiple tasks.

Machine Learning Strategy (Part 2)

This is the second blog post about machine learning strategy. It is about human-level performance, bias and variance (tradeoff) and how to improve your algorithm iteratively.

Machine Learning Strategy (Part 1)

Machine Learning Strategy is about how to tackle machine learning tasks strategically. In three blog posts I will try to give an introduction into this topic and I also hope for some comments and opinions on this topic.

Is catboost the best gradient boosting R package?

Several R packages that use different methods are out there for using gradient boosting methods. The three most famous ones are currently xgboost, catboost and lightgbm. I want to compare these three to find out which is the best one in their default mode without tuning. These algorithms are not pure gradient boosting algorithms but combine it with other useful methods such as bagging which is for example used in random forest.

New xgboost defaults

xgboost is the most famous R package for gradient boosting and it is since long time on the market. In one of my publications, I created a framework for providing defaults (and tunability measures) and one of the packages that I used there was xgboost. The results provided a default with the parameter nrounds=4168, which leads to long runtimes.

Hence, I wanted to use the data used in the paper to set nrounds to 500 and optimize the other parameters to get optimal defaults.

Overview of statistical tests

Here an overview table of statistical tests that I use for the consultation of medical doctors. Easy-to-use and easy to understand. I usually also hand it over to them. (PDF)

mlr vs. caret

Let’s compare the two popular R packages for machine learning mlr and caret.

caret is longer on the market, its first CRAN release seems to be from 2007, while mlr came to CRAN on 2013. As for now, caret seems to be more popular, according to cranlogs caret was downloaded 178029 times in the last 30 days, while mlr was downloaded 11408 times in the last 28 days.

Implementations and defaults of the Support Vector Machine in R

In this article I will present a benchmark analysis of different implementations of support vector machines in R. The three packages that I will compare are the most popular package e1071, the also well known package kernlab and the less well known package liquidSVM.

Hyperparameters of the Support Vector Machine

The support vector machine (SVM) is a very different approach for supervised learning than decision trees. In this article I will try to write something about the different hyperparameters of SVM.

Tuning random forest

Random forest is one of the standard approaches for supervised learning nowadays. One of its advantages is that it does not require tuning of the hyperparameters to perform good. But is that really true? Maybe we are not only interested in a good model but in the best model we could get…

Tuning requires a lot of time and computational effort and is still difficult to execute for beginners as there are no standardized ways and only few packages for tuning. With my new package tuneRanger I try to fill the gap for the random forest algorithm in R.

Update on Random Forest Package Downloads

I just updated the code from a previous post where I analysed the download statistics of different random forest packages in R, see the code at the bottom of the article. I calculated the number of cran downloads in march 2016 and march 2017.

Standard random forest

The number of download of different packages containing the random forest algorithm in march 2016 and in march 2017:

package	–march 2016	–march 2017	–ratio of 2017/2016
randomForest	29360	55415	1.89
xgboost	4929	12629	2.56
randomForestSRC	2559	6106	2.39
ranger	1482	5622	3.79
Rborist	298	441	1.48

What is clearly visible is the general increase in downloads of all packages that contain the standard random forest. The biggest gain in popularity could achieve the ranger package that allows to run the random forest in parallel on a simple machine. xgboost and randomForestSRC got a bigger increase than the standard randomForest package.

Benchmark results (mlr-learners on OpenML)

There are already some benchmarking studies about different classification algorithms out there. The probably most well known and most extensive one is the Do we Need Hundreds of Classifers to Solve Real World Classication Problems? paper. They use different software and also different tuning processes to compare 179 learners on more than 121 datasets, mainly from the UCI site. They exclude different datasets, because their dimension (number of observations or number of features) are too high, they are not in a proper format or because of other reasons. There are also summarized some criticism about the representability of the datasets and the generability of benchmarking results. It remains a bit unclear if their tuning process is done also on the test data or only on the training data (page 3154). They reported the random forest algorithms algorithms to be the best one (in general) for multiclass classification datasets and the support vector machine (svm) the second best one. On binary class classification tasks neural networks also perform competitively. They recommend the R library caret for choosing a classifier.

Other benchmarking studies use much less datasets and are much less extensive (e.g. the Caruana Paper). Computational power was also not the same on these days.

In my first approach for benchmarking different learners I follow a more standardized approach, that can be easily redone in future when new learners or datasets are added to the analysis. I use the R package OpenML for getting access to OpenML datasets and the R package mlr (similar to caret, but more extensive) to have a standardized interface to machine learning algorithms in R. Furthermore the experiments are done with the help of the package batchtools, in order to parallelize the experiments (Installation via devtools package: devtools::install_github(“mllg/batchtools”)).

Benchmarking algorithms with OpenML and R-package mlr

The open data platform OpenML offers a lot of dataset to evaluate algorithms. It offers datasets that can be used for classification and regression. It is accessible via the web page or via several programming languages like Java, Python and R.

I am performing a benchmark study with datasets from OpenML and classification and regression algorithms from the mlr package, which provide standardized interface to many of the algorithms implemented in R and easy evaluation with resampling strategies like cross-validation.

The comparison can be found on this github repository: benchmark-mlr-openml.

I will soon post a summary about the results.

Another comparison just between random forest and logistic regression is done by Raphael: BenchmarkOpenMl.

Random feature selection

As mentioned in the previous post I will write a bit about the random feature selection in random forest. In the training step at each split in a random forest k features are selected at random from all features. For these features the ideal split according to a split criteria is chosen and the feature which performs best under all features is chosen as feature. The number k should not be set too high, so that not always the same features are chosen, but also not too small, so that at least some relevant features come into the comparison. It also depends highly on the dataset. If in the dataset there are only few features with relevance, the number k should be set sufficiently high, and vice versa.

The introduction of randomness in the feature selection process seems to be advantageous in many cases. The default in many packages like randomForest is to choose k as the square root of the number features.

Bagging

Bagging is one of the core principles of random forests and is a highly advantegous modeling strategy. I will describe now how bagging can be done and some of its advantages and disadvantages. The underlying base models can be of any kind, in random forests decision trees are the base models.

Training phase

In bagging several base models are trained and for each model only a subset of the training observations is used to train each model, which are also called bagged models.

A More Complete List of Random Forest R Packages

In my last post I provided a small list of some R packages for random forest. Today I will provide a more complete list of random forest R packages. In the first table I list the R packages which contains the possibility to perform the standard random forest like described in the original Breiman paper.

package	RStudio downloads in the last month
randomForest	28353
xgboost	4537
randomForestSRC	2291
ranger	1347
Rborist	284

Random Forest in R

Random forests were formally introduced by Breiman in 2001. Due to his excellent performance and simple application, random forests are getting a more and more popular modeling strategy in many different research areas.

Random forests are suitable in many different modeling cases, such as classification, regression, survival time analysis, multivariate classification and regression, multilabel classification and quantile regression.

An overview of existing random forest implementations and their speed performance can be found in the ranger documentation, altough this list is not exhaustive and many new implementations are comming up. The performances of models build with different packages slightly differ, depending on how the random forest algorithm was implemented.

Now I will present some random forest implementations in R. A good site to find all R packages to one specific topic is Metacran.

First Post

Hello out there!

This is my first blog post on this blog.

I will post here regularly about topics related to tree-based methods in machine learning.