Keywords: Foundation Model, Tabular Data, Synthetic Data Generation
Abstract: Generative Foundation Models (GFMs) have achieved remarkable success in producing high-quality synthetic data for images and text. However, their application to tabular data presents significant challenges due to the heterogeneous nature of table features. Current cross-table learning frameworks struggle with the absence of a generative model backbone and a mechanism to decode heterogeneous feature values. To address these challenges, we propose the Cross-Table Synthesizer (CTSyn), a diffusion-based foundational model for tabular data generation. CTSyn features two key components: an Autoencoder network that consolidates diverse tables into a unified latent space and dynamically reconstructs table values based on the provided table schema embedding, adapting to heterogeneous datasets; and a conditional latent diffusion model that samples from this learned latent space. Through large-scale pre-training, CTSyn not only outperforms existing table synthesizers on standard tabular data generation benchmarks in terms of utility and diversity, but also uniquely enhances the performance of downstream machine learning tasks, surpassing what is achievable with real data. This establishes CTSyn as a new paradigm for synthetic table generation.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 12491
Loading