Synthesizing Tabular Data with Latent Semantic Regularization

TMLR Paper2415 Authors

23 Mar 2024 (modified: 12 Jun 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Modern generative models have shown remarkable capabilities in synthesizing tabular data, yet they often fall short in preserving the semantic integrity of generated samples, which can be interpreted as a form of hallucination. To address this gap, we propose a novel framework that formulates this problem as a constrained optimization problem and provides a solution for unsupervised learning of the implicit semantic constraints of the data and subsequently encouraging the generative model to respect the learned semantic boundaries through regularization. Our framework includes a \textit{validator} component in form of a latent space model that is tasked with capturing the underlying semantic structures of the training data. This generic validator can be used to regularize the \textit{synthesizer} model and steer it towards improving the semantic integrity of the synthesized data. We showcase our framework with a VAE-based validator and GAN-based synthesizer. We propose metrics designed specifically to measure the semantic integrity of the synthesized data and demonstrate that our approach not only maintains general quality of the generated data but also ensures a higher adherence to complex, domain-specific semantic relationships within the generated datasets.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Section 4 extended. Fixes applied to section 3, specifically regarding equation 3.1. Also $C(x)$ has been marked in the method section 3. The confusion of $C$ the constraint function and $C$ the auxiliary classifier has been addressed.
Assigned Action Editor: ~Zhe_Gan1
Submission Number: 2415
Loading