Abstract: Generating realistic, safe, and useful tabular data is important for downstream tasks such as (privacy preserving) imputation, oversampling, explainability, and simulation. However, the structure of tabular data, marked by heterogeneous types, non-smooth distributions, complex feature dependencies, and categorical imbalance, poses significant challenges. Although many generative approaches have been proposed, a fair and unified evaluation across datasets remains missing. This work benchmarks five recent model families on 16 diverse datasets (average 80 K rows), with careful optimization of hyperparameters, feature encodings, and architectures. We show that dataset-specific tuning leads to substantial performance gains, particularly for diffusion-based models. We further introduce constrained hyperparameter spaces that retain competitive performance while significantly reducing tuning cost, enabling …
Loading