Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Published: 06 Mar 2025, Last Modified: 11 Mar 2025ICLR 2025 Workshop Data Problems OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: synthetic data, differential privacy
TL;DR: We propose a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning.
Abstract: Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods like private evolution, depend heavily on manual prompts and ineffectively use private information in their filtering-based process. To overcome these limitations, we propose CTCL (Data Synthesis with Controllability and Clustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To adapt to the private domain, the generator is DP-finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Further analysis validates the design of each framework component and highlights the scalability of our approach.
Submission Number: 67
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview