Synthetic Text Generation for Training Large Language Models via Gradient Matching

Dang Nguyen; Zeman Li; Mohammadhossein Bateni; Vahab Mirrokni; Meisam Razaviyayn; Baharan Mirzasoleiman

Synthetic Text Generation for Training Large Language Models via Gradient Matching

Dang Nguyen, Zeman Li, Mohammadhossein Bateni, Vahab Mirrokni, Meisam Razaviyayn, Baharan Mirzasoleiman

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a new synthetic data generation approach with a theoretical guarantee using gradient matching.

Abstract: Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at [https://github.com/BigML-CS-UCLA/GRADMM](https://github.com/BigML-CS-UCLA/GRADMM).

Lay Summary: Large Language Models (LLMs) require massive amounts of high-quality data, but collecting and using such data raises concerns about privacy, cost, and efficiency. Our work introduces GRADMM, the first method that can generate readable synthetic text with strong theoretical guarantees for training LLMs. Unlike existing methods that rely on expensive prompts or unreadable embeddings, GRADMM creates human-like text that mimics how real data trains the model—without leaking sensitive information. We achieve this by matching the training dynamics (gradients) of real data using a technique called ADMM, while ensuring the output is coherent and diverse. This allows us to train LLMs using only a handful of real examples or replace real data entirely with synthetic ones. Our experiments show that GRADMM outperforms both traditional data selection and LLM-generated text in accuracy—while being significantly more private and efficient. This opens the door to safer and more accessible LLM training.

Link To Code: https://github.com/BigML-CS-UCLA/GRADMM

Primary Area: Deep Learning->Large Language Models

Keywords: Synthetic data, Large language models, Gradient matching, ADMM

Submission Number: 13263

Loading