Differentially Private Synthetic Data Generation with Diversity via APIs

ICLR 2026 Conference Submission730 Authors

02 Sept 2025 (modified: 23 Dec 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Differential privacy, synthetic data generation, foundation models
TL;DR: Differential private synthetic data generation with black box access to foundation models
Abstract: Synthetic data has emerged as a key solution for preserving the privacy of original data in fields dealing with sensitive information, such as healthcare and finance. Recent advancements in foundation models have significantly improved the quality of synthetic data. However, most high-performance foundation models are only available as black-box APIs, limiting fine-tuning capabilities and requiring private data containing sensitive information to be transmitted to external servers. To address this issue, PE was introduced as a privacy-preserving synthetic data generation method that leverages genetic algorithms with black-box foundation models. Nevertheless, due to its evolutionary process, PE tends to repeatedly focus on a limited subset of samples, leading to a significant reduction in the diversity of the generated synthetic dataset. Since diversity is a crucial factor for enhancing the utility of synthetic data and ensuring robustness across various scenarios, we propose Div-PE, an improved approach that overcomes the diversity limitations of PE through a sample-variant two-stage voting mechanism. This method enhances data diversity and yields a 17.2\% gain in FID and an 11.0\% increase in downstream accuracy on ResNet-18, averaged over ImageNet, Camelyon17, and UTKFace. Furthermore, Div-PE demonstrates its versatility by delivering strong experimental results not only on image data but also across other modalities, including tabular and text data, validating its applicability to a wide range of data types.
Primary Area: generative models
Submission Number: 730
Loading