Keywords: synthetic data generation, diffusion model, generative model, tabular data, mixed-type data
TL;DR: We propose a continuous time diffusion model for mixed-type tabular data and show that accounting for feature heterogeneity in the design of noise schedules increases sample quality.
Abstract: Score-based generative models or diffusion models have proven successful across
many domains in generating texts and images. However, the consideration of
mixed-type tabular data with this model family has fallen short so far. Existing
research mainly combines continuous and categorical diffusion processes and does
not explicitly account for the feature heterogeneity inherent to tabular data. In this
paper, we combine score matching and score interpolation to ensure a common
type of continuous noise distribution that affects both continuous and categorical
features. Further, we investigate the impact of distinct noise schedules per feature or
per data type. We allow for adaptive, learnable noise schedules to ensure optimally
allocated model capacity and balanced generative capability. Results show that
our model outperforms the benchmark models consistently and that accounting for
heterogeneity within the noise schedule design boosts sample quality.
Submission Number: 57
Loading