TL;DR: Efficient method for scaling differentially private training for long context LLMs
Abstract: Large language models (LLMs) are trained on vast datasets that may contain sensitive information. Differential privacy (DP), the de facto standard for formal privacy guarantees, provides a principled framework for training LLMs with provable privacy protection.
However, state-of-the-art DP training implementations rely on *fast gradient clipping* techniques with memory overhead $O(B\min(T^2, d^2))$, where $B$ is the batch size, $T$ is the sequence length, and $d$ is the layer width. This becomes prohibitive as both model width and context length grow. We propose DP-SGD-RC, a novel variant of DP-SGD with *randomized clipping* that reduces memory and compute overhead. DP-SGD-RC leverages *stochastic trace estimation* methods, specifically *Hutchinson's estimator* and its improved variant, Hutch$^{++}$, to reduce the memory footprint of per-sample gradient norm estimation.
We provide a tight privacy analysis showing that DP-SGD-RC achieves noise multipliers competitive with deterministic clipping.
Experiments fine-tuning Llama 3.2 1B on long-context benchmarks spanning classification, question answering, and summarization tasks demonstrate that DP-SGD-RC matches baseline utility while significantly reducing memory and compute.
Lay Summary: DP-SGD is a widely used algorithm for differentially private optimization in LLMs However, due to per-sample clipping, it introduces major memory and compute overhead. Recent works have attempted in reducing the memory usage of privatizing gradients using techniques like Ghost Clipping and Book Keeping, but for long context models, existing techniques and variants of SGD scale poorly with sequence length. We propose a technique which does DP-SGD for such long context models in a memory efficient manner using randomized clipping with a new analysis to show the privacy of our algorithm. Our experiments demonstrate that our proposed algorithm matches the baseline utility whilst significantly reducing memory and compute.
Primary Area: Social Aspects->Privacy
Keywords: differential privacy, clipping, dp-sgd, llm, memory efficiency, compute efficiency, random projections
Originally Submitted PDF: pdf
Submission Number: 18622
Loading