Improving Tabular Generative Models: Loss Functions, Benchmarks, and Iterative Objective Bayesian Approaches

Minh Hoang Vu; Daniel Edler; Carl Wibom; Tommy Löfstedt; Beatrice Melin; Martin Rosvall

Improving Tabular Generative Models: Loss Functions, Benchmarks, and Iterative Objective Bayesian Approaches

Minh Hoang Vu, Daniel Edler, Carl Wibom, Tommy Löfstedt, Beatrice Melin, Martin Rosvall

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: generative adversarial network, synthetic data, correlation- and distribution-aware loss function, iterative objective refinement Bayesian optimization, benchmarking framework

TL;DR: We propose a novel loss function and optimization method that significantly improve the ability of deep generative models to create high-quality synthetic tabular data for better machine learning performance.

Abstract: Access to extensive data is essential for improving model performance and generalization in deep learning (DL). When dealing with sparse datasets, a promising solution is to generate synthetic data using deep generative models (DGMs). However, these models often struggle to capture the complexities of real-world tabular data, including diverse variable types, imbalances, and intricate dependencies. Additionally, standard Bayesian optimization (SBO), commonly used for hyper-parameter tuning, struggles with aggregating metrics of different units, leading to unreliable averaging and suboptimal decisions. To address these gaps, we introduce a novel correlation- and distribution-aware loss function that regularizes DGMs, enhancing their ability to generate synthetic tabular data that faithfully represents actual distributions. To aid in evaluating this loss function, we also propose a new multi-objective aggregation method using iterative objective refinement Bayesian optimization (IORBO) and a comprehensive statistical testing framework. While the focus of this paper is on improving the loss function, each contribution stands on its own and can be applied to other DGMs, applications, and hyperparameter optimization techniques. We validate our approach using a benchmarking framework with twenty real-world datasets and ten established tabular DGM baselines. Results demonstrate that the proposed loss function significantly improves the fidelity of the synthetic data generated with DGMs, leading to better performance in downstream machine learning (ML) tasks. Furthermore, the IORBO consistently outperformed SBO, yielding superior optimization results. This work advances synthetic data generation and optimization techniques, enabling more robust applications in DL.

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9763

Loading