PCEvolve: Private Contrastive Evolution for Synthetic Dataset Generation via Few-Shot Private Data and Generative APIs

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Abstract: The rise of generative APIs has fueled interest in privacy-preserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP’s utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCEvolve outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available at https://github.com/TsingZ0/PCEvolve.
Lay Summary: (1) **Problem**: Creating realistic synthetic data for specialized fields like healthcare is crucial for training AI models, but privacy concerns and limited data access pose major challenges. Existing methods struggle when only a few private examples are available, as adding privacy protections often ruins data quality. (2) **Solution**: We developed a new algorithm, PCEvolve, that generates high-quality synthetic images while protecting privacy—even with just a handful of examples. By focusing on key differences between data categories (e.g., tumor vs. healthy tissue) and using smarter selection strategies, PCEvolve guides AI tools to produce synthetic images that closely match the private data without directly exposing it. (3) **Impact**: PCEvolve outperforms existing methods across medical and industrial datasets, enabling clinics or factories with limited data to safely leverage powerful AI tools. This breakthrough makes privacy-preserving synthetic data practical for critical applications, helping democratize AI access while safeguarding sensitive information. Our open-source tool allows researchers to explore this approach further.
Link To Code: https://github.com/TsingZ0/PCEvolve
Primary Area: Social Aspects->Privacy
Keywords: synthetic data generation, differential privacy, evolution algorithm
Submission Number: 10653
Loading