Abstract: Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: ### Revision 2025-06-01
We have updated the manuscript in response to the reviewers' helpful feedback. Two changes stand out, as they address concerns around TMLR's criteria for claims and evidence.
First, we have clarified and "toned down" our empirical claims. We previously claimed SotA performance on imputation, SotA performance on generation-with-missingness, and competitive performance on generation-without-missingness. Now we state the following in the Introduction (with similar remarks elsewhere):
"On this benchmark for "out-of-the-box" performance on small tabular datasets, our approach offers leading performance on imputation and state-of-the-art performance on generation given data with missingness; and it has competitive performance on vanilla generation without missingness."
Second, we added the following to the Limitations section (with similar remarks elsewhere) to emphasize that we are focused on small datasets and on users with limited computing resources:
"Related to the above remarks, we would like to emphasize that our proposed approach is aimed at and evaluated on smaller-sized tabular datasets. It is also evaluated via “out-of-the-box” performance, being aimed at users lacking the resources for large deep learning models or hyperparameter optimization. For users with access to larger tabular datasets and more extensive computing resources, recent deep learning methods like Tabsyn (Zhang et al., 2024) would be expected to perform better."
### Revision 2025-06-05
Additional explanations motivating conditional distribution modeling for autoregression, and on selection of BaltoBot meta-tree height.
Assigned Action Editor: ~Jes_Frellsen1
Submission Number: 4359
Loading