Do Tree-based Models Need Data Preprocessing?

Published: 12 Jul 2024, Last Modified: 09 Aug 2024AutoML 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data preprocessing, tree-based models, machine learning, automated machine learning
TL;DR: In this work we evaluate the impact of preprocessing strategies on tree-based models and discuss the most beneficial ones.
Abstract: The number of machine learning (ML) algorithms, and ML-related methodologies, which increase the overall performance, grows every year. The abundance of possibilities makes it impossible for the data scientists to test all of them every time, thus the need for best practices evaluation studies exists. In the scope of this paper, we attempt to evaluate the impact of preprocessing strategies on tree-based models. To conduct this study we prepare 38 different preprocessing strategies and train almost one million tree-based models. Furthermore, we analyze the impact of different data preparation strategies and outline the best-performing ones with the usage of newly introduced preprocessibility measure.
Submission Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Optional Meta-Data For Green-AutoML: All questions below on environmental impact are optional.
CPU Hours: 1300
GPU Hours: 0
TPU Hours: 0
Evaluation Metrics: No
Submission Number: 13
Loading