Towards Synthetic Data for Fine-tuning Tabular Foundation Models

Magnus Bühler; Lennart Purucker; Frank Hutter

Towards Synthetic Data for Fine-tuning Tabular Foundation Models

Magnus Bühler, Lennart Purucker, Frank Hutter

Published: 09 Jun 2025, Last Modified: 02 Jul 2025FMSD @ ICML 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Syntehthic Data, Finetuning, Tabular Foundation Models

TL;DR: Syntehthic Data Geneartion for Finetuning Tabular Foundation Models

Abstract: Tabular foundation models pre-trained on synthetically generated datasets have exhibited strong in-context learning capabilities. While fine-tuning can further enhance predictive performance, overfitting to the training data of a downstream task poses a significant risk in tiny-to-small data regimes. We propose a fine-tuning method that employs synthetically generated fine-tuning data to avoid overfitting and improve generalization performance. We study three variants of data generation methods and empirically demonstrate that they mitigate overfitting and outperform standard fine-tuning approaches across five tiny-to-small real-world datasets. Our data generation methods leverage density estimators and structural causal models, akin to those employed during pre-training, to yield the best performance. Our findings indicate that synthetic data generation, a central element in pre-training, can be successfully adapted to enhance fine-tuning.

Submission Number: 106

Loading