Explicit Column Relationship-Based Diffusion Model for High-Quality Synthetic Tabular Data Generation
Keywords: synthetic tabular data generation, constraint-guided diffusion model, inter-column dependencies, real-world constraints
Abstract: Tabular data plays a vital role in critical applications such as healthcare, finance, and education. Its effective utilization in data-driven models is frequently hindered by data scarcity and privacy concerns. In response, synthetic tabular data generation has emerged as a powerful solution that provides privacy-preserving data mirroring real-world distributions. However, many existing generative models still struggle to preserve the complex column relationships within tabular data. Additionally, they often fail to account for the real-world constraints that are essential for ensuring the authenticity and practical usability of the generated data. In this paper, we propose ECR-DM, the Explicit Column Relationship-Based Diffusion Model for synthetic tabular data generation. In the forward diffusion process, we introduce the Noise Perturbation Mechanism, which enables the model to learn column distributions in a fine-grained manner. In the reverse diffusion process, we incorporate Constraint-Guided Recovery, which guides the model to recover inter-column dependencies and restore the true data distribution. NPM helps the diffusion model capture the detailed column-wise characteristics of the data, while CGR ensures the preservation of inter-column relationships and the high-quality synthetic tabular data generation. We validate the effectiveness of our approach through extensive experiments on six tabular data benchmarks. Our model outperforms state-of-the-art methods across seven evaluation metrics, particularly in downstream tasks. Code is available at https://anonymous.4open.science/r/ECR-DM-0C72.
Primary Area: generative models
Submission Number: 8285
Loading