TabRep: Training Tabular Diffusion Models with a Simple and Effective Continuous Representation

TMLR Paper5537 Authors

02 Aug 2025 (modified: 21 Dec 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Diffusion models have been the predominant generative model for tabular data generation. However, they face the conundrum of modeling under a separate versus a unified data representation. The former encounters the challenge of jointly modeling all multi-modal distributions of tabular data in one model. While the latter alleviates this by learning a single representation for all features, it currently leverages sparse suboptimal encoding heuristics and necessitates additional computation costs. In this work, we address the latter by presenting TabRep, a tabular diffusion architecture trained with a unified continuous representation. To motivate the design of our representation, we provide geometric insights into how the data manifold affects diffusion models. The key attributes of our representation are composed of its density, flexibility to provide ample separability for nominal features, and ability to preserve intrinsic relationships. Ultimately, TabRep provides a simple yet effective approach for training tabular diffusion models under a continuous data manifold. Our results showcase that TabRep achieves superior performance across a broad suite of evaluations. It is the first to synthesize tabular data that exceeds the downstream quality of the original datasets while preserving privacy and remaining computationally efficient.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have incorporated the AE's comments including: - Notation Fix: Remove redundant superscript (i) from (Equations 11-12, Section 4.2, possibly elsewhere). As the AE understands the text, cardinality is a column property and not row-varying. - Optional improvement: Consider moving loss stability analysis (Appendix D.5, Figure 7) to main text to strengthen theory-practice connection. - Clarify ordering: Explicitly state in Section "Ordering of Categorical Feature" (pp. 9-10) that lexicographic ordering is a heuristic baseline, not optimized for semantic structure.
Code: https://github.com/jacobyhsi/TabRep
Assigned Action Editor: ~Jan-Willem_van_de_Meent1
Submission Number: 5537
Loading