SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy

Rihae Park; Yeonjae Kim; Seung Yul Lee; Yeonhong Park; Jae W. Lee

SpareTrain: Fault-Tolerant LLM Training via Low-Cost Dual Modular Redundancy

Rihae Park, Yeonjae Kim, Seung Yul Lee, Yeonhong Park, Jae W. Lee

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, distributed training, silent data corruption, fault-tolerance, activation checkpointing, parallelism

TL;DR: SpareTrain is a novel framework that achieves full DMR protection for LLM training with negligible overhead, by repurposing activation checkpointing and exploiting idle GPU time to preserve throughput.

Abstract: Dual Modular Redundancy (DMR) is a highly effective mechanism for detecting silent data corruption (SDC)—a critical reliability concern in large language model (LLM) training—by executing each operation twice. However, its high computation overhead has prevented practical deployment at scale. In this paper, we present SpareTrain, an LLM training system that achieves complete DMR with minimal overhead by repurposing the activation checkpointing mechanism and exploiting idle GPU time. Evaluations on up to 32 H200 GPUs show that SpareTrain improves throughput by 12–35\% over naive DMR, corresponding to only 3–14\% overhead compared to unprotected training, while maintaining full DMR error detection capabilities.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 23985

Loading