Improving Tabular Generative Models: Loss Functions, Benchmarks, and Improved Multi-objective Bayesian Optimization Approaches
Abstract: Access to extensive data is essential to improve model performance and generalization in deep learning (DL). When dealing with sparse datasets—those with limited samples relative to model complexity—a promising solution is to generate synthetic data using deep generative models (DGMs). However, these models often struggle to capture the complexities of real-world tabular data, including diverse variable types, imbalances, and intricate dependencies. Additionally, standard Bayesian optimization (SBO), commonly used for hyper-parameter tuning, struggles to optimize over aggregated metrics with different units, leading to unreliable averaging and suboptimal decisions. To address these gaps, we introduce a novel correlation- and distribution-aware loss function that regularizes DGMs, enhancing their ability to generate synthetic tabular data that faithfully represents the underlying data distributions. Theoretical guarantees for the proposed loss functions are provided, including stability and consistency analyses, ensuring their robustness. To enable principled hyperparameter search via Bayesian optimization (BO), we also propose a new multi-objective aggregation strategy based on iterative objective refinement Bayesian optimization (IORBO), along with a comprehensive statistical testing framework. We validate the proposed approach using a benchmarking framework with twenty real-world datasets and ten established tabular DGM baselines. The results demonstrate that the proposed loss function significantly improves the fidelity of the synthetic data generated with DGMs, leading to better performance in downstream machine learning (ML) tasks. Furthermore, the IORBO consistently outperformed SBO, yielding superior hyper-parameter results. This work advances synthetic data generation and optimization techniques, enabling more robust DL applications.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Sinead_Williamson1
Submission Number: 5019
Loading