Abstract: Substantial quantity and high quality are the golden rules of making a good training dataset with sample privacy protection equally important. Generating synthetic samples that resemble high-quality private data while ensuring Differential Privacy (DP), a formal privacy guarantee, promises scalability and practicality. However, existing methods relying on pre-trained models for data synthesis often struggle in data-deficient scenarios, suffering from limited sample size, inevitable generation noise and existing pre-trained model bias. To address these challenges, we propose a novel contr**A**stive private data **S**ynthesis via **W**eighted multiple **P**re-trained generative models framework, named as **WASP**. WASP utilizes limited private samples for more accurate private data distribution estimation via a Top-*Q* voting mechanism, and leverages low-quality synthetic samples for contrastive generation via collaboration among dynamically weighted multiple pre-trained models. Extensive experiments on 6 well-developed datasets with 6 open-source and 3 closed-source PLMs demonstrate the superiority of WASP in improving model performance over diverse downstream tasks. Code is available at https://github.com/LindaLydia/WASP.
Lay Summary: How can we create useful training data without risking anyone’s privacy? This was the question we set out to explore by studying how to generate synthetic data that mimics real private datasets containing limited samples, while revealing as little as possible about the individuals behind them.
Our work introduces a method called **WASP**, which combines the strengths of multiple AI models to produce realistic and privacy-preserving synthetic data. Unlike existing approaches that rely on a single model or large amounts of real data, **WASP** uses a collaborative strategy: it asks different models to generate data, scores the results using limited private examples, and then learns to trust the best-performing models more in future rounds. It also learns not just from good examples but from bad ones as well, by contrasting them during training.
We found that this strategy leads to better synthetic data even when only limited real examples are available. This is important, as many real-world applications, like healthcare and finance, must work with sensitive data of small amount that cannot be freely shared. Our results suggest a promising path forward for training AI models in data-scarce, privacy-sensitive environments.
Link To Code: https://github.com/LindaLydia/WASP
Primary Area: Social Aspects->Privacy
Keywords: Differentially Private Synthetic Dataset, Collaboration between Private Data and Private Model, Fusion of Pre-trained Language Model
Submission Number: 830
Loading