Feature Encodings for Gradient Boosting with Automunge

Nicholas Teague

Feature Encodings for Gradient Boosting with Automunge

Nicholas Teague

Published: 20 Oct 2022, Last Modified: 22 Jun 2025HITY Workshop NeurIPS 2022Readers: Everyone

Keywords: preprocessing, tabular

TL;DR: Validation of default encodings by Automunge library, note that binarization outperformed one hot encoding.

Abstract: Automunge is a tabular preprocessing library that encodes dataframes for supervised learning. When selecting a default feature encoding strategy for gradient boosted learning, one may consider metrics of training duration and achieved predictive performance associated with the feature representations. Automunge offers a default of binarization for categoric features and z-score normalization for numeric. The presented study sought to validate those defaults by way of benchmarking on a series of diverse data sets by encoding variations with tuned gradient boosted learning. We found that on average our chosen defaults were top performers both from a tuning duration and a model performance standpoint. Another key finding was that one hot encoding did not perform in a manner consistent with suitability to serve as a categoric default in comparison to categoric binarization. We present here these and further benchmarks.

Supplementary Material: zip

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/feature-encodings-for-gradient-boosting-with/code)

3 Replies

Loading