SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy

Published: 26 Jan 2026, Last Modified: 01 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, distributed training, silent data corruption, fault-tolerance, activation checkpointing, parallelism
TL;DR: SpareTrain is a novel framework that achieves full DMR protection for LLM training with negligible overhead, by repurposing activation checkpointing and exploiting idle GPU time to preserve throughput.
Abstract: Dual Modular Redundancy (DMR) is a highly effective mechanism for detecting silent data corruption (SDC)—a critical reliability concern in large language model (LLM) training—by executing each operation twice. However, its high computation overhead has prevented practical deployment at scale. In this paper, we present SpareTrain, an LLM training system that achieves complete DMR with minimal overhead by repurposing the activation checkpointing mechanism and exploiting idle GPU time. Evaluations on up to 32 H200 GPUs show that SpareTrain improves throughput by 12–35\% over naive DMR, corresponding to only 3–14\% overhead compared to unprotected training, while maintaining full DMR error detection capabilities.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 23985
Loading