Scaling Up Diffusion and Flow-based XGBoost Models

Published: 17 Jun 2024, Last Modified: 17 Jul 2024ICML2024-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Original Research Track, Generative Models, Tabular Data, Diffusion, Flow-matching, Physics, Engineering
TL;DR: We scale up a recent proposal for using XGBoost as the function approximator in diffusion and flow-matching models on tabular data, presenting results on particle physics datasets 370x larger than previously tested.
Abstract: Novel machine learning methods for tabular data generation are often developed on small datasets which do not match the scale required for scientific applications. We investigate a recent proposal to use XGBoost as the function approximator in diffusion and flow-matching models on tabular data, which proved to be extremely memory intensive, even on tiny datasets. In this work, we conduct a critical analysis of the existing implementation from an engineering perspective, and show that these limitations are not fundamental to the method; with better implementation it can be scaled to datasets 370x larger than previously used. We also propose algorithmic improvements that can further benefit resource usage and model performance, including multi-output trees which are well-suited to generative modeling. Finally, we present results on large-scale scientific datasets derived from experimental particle physics as part of the Fast Calorimeter Simulation Challenge.
Submission Number: 2
Loading