Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

Published: 26 Sept 2023, Last Modified: 11 Jan 2024NeurIPS 2023 Datasets and Benchmarks PosterEveryoneRevisionsBibTeX
Keywords: synthetic data generation, data-centric AI, tabular data, synthetic data evaluation
TL;DR: This paper proposes integrating data-centric AI into synthetic data generation, offering a new framework and practical recommendations based on an evaluation of state-of-the-art tabular data generation models.
Abstract: Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation --- despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.
Submission Number: 351
Loading