TL;DR: We propose a new synthetic data generation approach with a theoretical guarantee using gradient matching.
Abstract: Synthetic data has the potential to improve the performance, training efficiency, and privacy of real training examples. Nevertheless, existing approaches for synthetic text generation are mostly heuristics and cannot generate human-readable text without compromising the privacy of real data, or provide performance guarantees for training Large Language Models (LLMs). In this work, we propose the first theoretically rigorous approach for generating synthetic human-readable text that provides convergence, performance, and privacy guarantees for fine-tuning LLMs on a target task. To do so, we leverage Alternating Direction Method of Multipliers (ADMM) that iteratively optimizes the embeddings of synthetic examples to match the noisy gradient of the target training or validation data, and maps them to a sequence of text tokens with low perplexity. In doing so, the generated synthetic text guarantees convergence of the model to a close neighborhood of the solution obtained by fine-tuning on real data and preserves their privacy. Experiments on various classification tasks confirm the effectiveness of our proposed approach. Our code is available at [https://github.com/BigML-CS-UCLA/GRADMM](https://github.com/BigML-CS-UCLA/GRADMM).
Lay Summary: Large Language Models (LLMs) require massive amounts of high-quality data, but collecting and using such data raises concerns about privacy, cost, and efficiency. Our work introduces GRADMM, the first method that can generate readable synthetic text with strong theoretical guarantees for training LLMs. Unlike existing methods that rely on expensive prompts or unreadable embeddings, GRADMM creates human-like text that mimics how real data trains the model—without leaking sensitive information.
We achieve this by matching the training dynamics (gradients) of real data using a technique called ADMM, while ensuring the output is coherent and diverse. This allows us to train LLMs using only a handful of real examples or replace real data entirely with synthetic ones. Our experiments show that GRADMM outperforms both traditional data selection and LLM-generated text in accuracy—while being significantly more private and efficient. This opens the door to safer and more accessible LLM training.
Link To Code: https://github.com/BigML-CS-UCLA/GRADMM
Primary Area: Deep Learning->Large Language Models
Keywords: Synthetic data, Large language models, Gradient matching, ADMM
Submission Number: 13263
Loading