Keywords: large scale training, sparing strategy, modeling
TL;DR: A framework based on closed loop modeling to optimize sparing needs for large scale training jobs considering performance implications of failures.
Abstract: Training large language models (LLMs) on Meta’s AI clusters requires running long, distributed jobs that are
vulnerable to hardware failures. To maintain high availability and efficiency, production systems use sparing, i.e.,
pre-allocating spare compute resources that can replace failed components. However, choosing the optimal sparing
strategy-including compute block size, number of spare blocks, and spare GPU trays—is complex and directly
impacts cluster performance. We present an analytical framework with closed-form expressions to guide sparing
strategy decisions, making practical, first-order recommendations for production environments. We also develop a
simulation component to cross-validate the analytical model. Applied in Meta’s hyperscale infrastructure, this
model helps engineers optimize fault tolerance, minimize downtime, and maximize goodput during LLM training.
Our real-world use case demonstrates how the framework informs robust, cost-effective design choices critical to
Meta’s AI operations.
Supplementary Material: pdf
Topics: Reliability & Security: Fault-tolerant ML systems
Submission Number: 54
Loading