Keeping it Simple – Computational Resources in Deep Generative versus Traditional Methods for Synthetic Tabular Data Generation in Healthcare

NLDL 2025 Conference Submission21 Authors

04 Sept 2024 (modified: 14 Nov 2024)Submitted to NLDL 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: synthetic data, healthcare data, deep generative models, traditional statistical generative methods, computational resources, benchmarking
TL;DR: The experiment demonstrates the asymmetry in computational resource needs for deep generative methods compared to traditional methods for synthetic data generation and emphasizes a focus on sustainability in choice of methods.
Abstract: Synthetic data has emerged as a solution to address data access challenges in healthcare, particularly for accelerating AI tool development. Deep generative methods, including generative adversarial networks, variational autoencoders, and diffusion models, have gained prominence for creating realistic and representative synthetic datasets with low re-identification risk. However, while sustainability of future computational needs is a growing topic, computational needs are often overlooked when benchmarking solutions for tabular data in healthcare. This study compares traditional and deep generative methods in terms of computational resources needed, relative to differences in statistical similarity between the training dataset and the synthetic dataset. The findings reveal that while quality performance within this experiment is comparable, the deep generative methods consume significantly more resources, necessitating High Performance Computing resources. We hope researchers will increasingly include computational resources as a parameter when benchmarking methods, to build a bigger canvas of literature to guide the method choice.
Submission Number: 21
Loading