Self-Optimizing Random Forests

Felix Mohr

Self-Optimizing Random Forests

Felix Mohr

25 Feb 2022 (modified: 05 May 2023)AutoML 2022 (Late-Breaking Workshop)Readers: Everyone

Abstract: Random Forests (RFs) have become a standard pick of data scientists for classification and regression tasks over the last two decades. Several modifications of RFs have been proposed to eliminate the limitation of standard RFs to axis-aligned splits through the notion of Oblique Random Forests (ORFs), which allow that splits are defined over linear combinations of attributes rather than a single attribute. However, like there is no single best learner for all supervised learning problems, there is also not a single best RF type. In this paper, we present self-optimizing random forests (SORFs). SORFs incrementally create a forest by growing different tree types and identifying the best tree type based on extrapolations of the forest performance curves obtained from out-of-bag performance estimate. Our exhaustive empirical evaluation shows that SORFs consistently achieve the maximum performance across three considered types of RFs while requiring only half the time for training all these forests on average. At the same time, SORFs outperform standard RFs statistically significantly and by at least 0.01 in accuracy on 25% of the considered datasets and even substantially beyond 0.35 in two cases.

Keywords: Random Forests, Oblique Splits, Out-of-Bag Performance, Performance Curve

One-sentence Summary: This paper presents a learning algorithm for random forests that dynamically decide which tree type to used based on an analysis of their performance curves and thereby consistently achieves optimal performance.

Track: Main track

Reproducibility Checklist: Yes

Broader Impact Statement: Yes

Paper Availability And License: Yes

Code Of Conduct: Yes

Reviewers: I am already a reviewer.

CPU Hours: 1403

GPU Hours: 0

TPU Hours: 0

Evaluation Metrics: No

Steps For Environmental Footprint Reduction During Development: A database with the voting schemes of the different random trees was created on all train/test splits and all datasets. From this, it is possible (also for other researchers) to reproduce the results without actually training any tree again.

Estimated CO2e Footprint: 63.64

Main Paper And Supplementary Material: pdf

0 Replies

Loading