Unmasking Trees for Tabular Data

Calvin McCarter

Unmasking Trees for Tabular Data

Calvin McCarter

Published: 10 Oct 2024, Last Modified: 20 Oct 2024TRL @ NeurIPS 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: tabular, imputation, generation, trees

TL;DR: We show that tree-based autoregression is a strong, simple baseline for tabular data, with state-of-the-art results for data with missingness.

Abstract: Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation, and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.

Submission Number: 2

Loading