Keywords: differential privacy, tabular data, synthetic data
TL;DR: We use an evolutionary algorithm (Private Evolution) to generate differentially private synthetic data
Abstract: Tabular data is one of the most widely used formats in practice, yet much of it remains inaccessible due to privacy concerns. Synthetic data generation with formal privacy guarantees, i.e. differential privacy (DP), offers a promising solution to enable data sharing while protecting sensitive information. Despite extensive study, state-of-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we adapt the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for generating synthetic tabular data under DP. Tab-PE refines a synthetic dataset by an evolutionary process that leverages APIs to generate variations of the data, privately evaluate them, and retain the highest-quality samples. While the original PE requires access to large foundation models, Tab-PE is computationally efficient with heuristic APIs specialized for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10\% while running 28$\times$ faster.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21749
Loading