PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou; Akshat Shrivastava; Hongyuan Zhan; Rylan Conway; Trang Le; Adithya Sagar; Giulia Fanti; Daniel Lazar

PrE-Text: Training Language Models on Private Federated Data in the Age of LLMs

Charlie Hou, Akshat Shrivastava, Hongyuan Zhan, Rylan Conway, Trang Le, Adithya Sagar, Giulia Fanti, Daniel Lazar

Published: 05 Mar 2024, Last Modified: 04 May 2024PMLEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Differential privacy, large language models, synthetic data, federated learning

TL;DR: We generate DP synthetic data for finetuning models when data is private and federated.

Abstract: On-device training is the most common way to use private user data to train machine learning (ML) models. This has major drawbacks: (1) user devices are too small to train large models on-device, (2) it is communication and computation intensive for users, and (3) it can be hard to deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under the high privacy regime ($\epsilon = 1.29$). We achieve these results while using 7x less total client computation and 40x less communication than on-device training. Altogether, these results suggest in the high-privacy regime, training on DP synthetic data may be a better option than training models on-device on private distributed data.

Submission Number: 13

Loading