Sparing Strategies to Minimize Reliability Impact On Large Training Jobs

Kevin Quirk; Matthew Lennie; Ehsan K. Ardestani; Satyajeet Singh Ahuja; Matthew Bergeron; Andrew Grier; Zhaodong Wang; Mustafa Ozdal; Xu Zhang; Abhinav Triguna; Ying Zhang; Mathew Oldham; Chunqiang Tang

Sparing Strategies to Minimize Reliability Impact On Large Training Jobs

Kevin Quirk, Matthew Lennie, Ehsan K. Ardestani, Satyajeet Singh Ahuja, Matthew Bergeron, Andrew Grier, Zhaodong Wang, Mustafa Ozdal, Xu Zhang, Abhinav Triguna, Ying Zhang, Mathew Oldham, Chunqiang Tang

Published: 19 Mar 2026, Last Modified: 20 May 2026MLSys 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large scale training, sparing strategy, modeling

TL;DR: A framework based on closed loop modeling to optimize sparing needs for large scale training jobs considering performance implications of failures.

Abstract: Training large language models (LLMs) on Meta’s AI clusters requires running long, distributed jobs that are vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing, i.e., pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing strategy-including compute block size, number of spare blocks, and spare GPU trays—is complex and directly impacts cluster performance. We present an analytical framework with closed-form expressions to guide sparing strategy decisions, making practical, first-order recommendations for production environments. We also develop a simulation component to cross-validate the analytical model. Applied in Meta’s hyperscale infrastructure, this model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training. Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to Meta’s AI operations.

Supplementary Material: pdf

Topics: Reliability & Security: Fault-tolerant ML systems

Submission Number: 54

Loading