Continuous Diffusion for Mixed-Type Tabular Data

Published: 30 Oct 2023, Last Modified: 30 Nov 2023SyntheticData4ML 2023 PosterEveryoneRevisionsBibTeX
Keywords: synthetic data generation, diffusion model, generative model, tabular data, mixed-type data
TL;DR: We propose a continuous time diffusion model for mixed-type tabular data and show that accounting for feature heterogeneity in the design of noise schedules increases sample quality.
Abstract: Score-based generative models or diffusion models have proven successful across many domains in generating texts and images. However, the consideration of mixed-type tabular data with this model family has fallen short so far. Existing research mainly combines continuous and categorical diffusion processes and does not explicitly account for the feature heterogeneity inherent to tabular data. In this paper, we combine score matching and score interpolation to ensure a common type of continuous noise distribution that affects both continuous and categorical features. Further, we investigate the impact of distinct noise schedules per feature or per data type. We allow for adaptive, learnable noise schedules to ensure optimally allocated model capacity and balanced generative capability. Results show that our model outperforms the benchmark models consistently and that accounting for heterogeneity within the noise schedule design boosts sample quality.
Submission Number: 57