Synthesizing Privacy-Preserving Text Data via Finetuning *without* Finetuning Billion-Scale LLMs

Bowen Tan; Zheng Xu; Eric P. Xing; Zhiting Hu; Shanshan Wu

Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs

Bowen Tan, Zheng Xu, Eric P. Xing, Zhiting Hu, Shanshan Wu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning.

Abstract: Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with **C**on**T**rollability and **CL**ustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.

Lay Summary: Synthetic data presents a viable solution for training models while safeguarding privacy. We propose a novel framework for generating privacy-preserving synthetic data that avoids the limitations of extensive prompt engineering and billion-scale model finetuning. Our method can be used in the resource-constrained setting to synthesize data for domains that require strong provable privacy guarantees.

Primary Area: Applications->Language, Speech and Dialog

Keywords: synthetic data, language model, differential privacy

Submission Number: 13871

Loading