Using maximal information auxiliary variables to improve synthetic data generation based on TabPFN foundation models: preliminary results
Keywords: TabPFN, in-context learning, synthetic data generation
TL;DR: Synthetic data generation with TabPFN struggles with weakly associated variables due to low-signal contexts. MIAV adds rank-matched noise as auxiliaries, boosting context and yielding better synthetic data, faster generation, and order invariance.
Abstract: Synthetic data generation for tabular datasets is shifting toward the use of large, general-purpose foundation models. TabPFN, a state-of-the-art example, uses in-context learning to generate probabilistic predictions conditioned on observed examples in a single forward pass. However, when variables are only weakly associated with others, the model's ability to generate realistic synthetic data deteriorates, as the context examples provide little predictive signal. To address this, we introduce the maximal information auxiliary variable (MIAV) strategy, which increases context information with auxiliary variables constructed by rank-matching random noise variables to real data. We establish theoretical properties of the approach which explain its good performance for weakly associated variables. Additional practical advantages of the MIAV approach include improved computational efficiency and invariance to variable order during the synthetic data generation process. Empirical evaluations, on simulated and real datasets, illustrate how the MIAV strategy improves data generation when compared to direct application of TabPFN.
Submission Number: 113
Loading