Keywords: Differential Privacy, LLM, Distributed training
Abstract: Large language models excel at in-context learning but can memorize sensitive sequences, enabling membership-inference and extraction attacks. Differential privacy (DP) offers provable protection, yet DP training remains costly at long contexts. Prior work largely targets short-sequence DP fine-tuning, and the strongest public DP pretraining scales only to 1B parameters at 1,024 tokens. Profiling state-of-the-art distributed DP reveals two blockers: **efficiency** losses from a mixed ghost-norm clipping heuristic that wastes compute at longer sequences, and **scalability** limits where FSDP alone cannot break the single-GPU memory ceiling for sequence length. We introduce LongShield, a memory- and communication-efficient context-parallel DP training method that closes the performance gap to non-DP while enabling long-context scaling on modest GPU budgets. LongShield keeps per-sample gradients shard local to each GPU to avoid full materialization, overlaps per-sample gradient aggregation with backward computation to sustain throughput, and enables DP-safe activation checkpointing to further extend context. These system changes leave the underlying DP algorithm and accounting unchanged and use flat clipping for best convergence. On Llama-3.1-8B with 4×NVIDIA H100, LongShield scales sequence length from 4k to 16k, achieves linear sequence-length scaling, and shrinks the DP–non-DP throughput gap from 50% to 7% while matching non-DP memory usage. With activation checkpointing, LongShields reaches 32k context at the expected checkpointing overhead. These results show that long-context DP training is practical on modest GPU budgets.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 22163
Loading