Keywords: distributed deep learning, pipeline parallelism, large-scale training, Distributed Systems
Abstract: Pipeline parallelism is a cornerstone of large-scale model training, yet its efficiency is fundamentally limited by straggler-induced pipeline bubbles. This issue is exacerbated by static scheduling approaches, including handcrafted heuristics and Integer Linear Programming (ILP), that are inherently brittle to real-world execution time variance. In this work, we introduce \textsc{Conductor}, a dynamic, two-tiered scheduling framework that, to our knowledge, is the first to virtually eliminate straggler-induced bubbles under realistic, stochastic conditions. The key insight is to decouple global, long-horizon scheduling from local, instantaneous load balancing. At a \textbf{coarse grain}, a reinforcement learning (RL) agent leverages millisecond-scale inference to generate robust global schedules, adapting to runtime dynamics in scenarios where traditional static solvers are intractable. At a \textbf{fine grain}, we introduce a dynamic computation migration mechanism that resolves residual micro-bubbles by offloading sub-computations, such as attention heads, from transiently slower to faster devices within a single timestep. Evaluated on large-scale LLM training configurations, our framework outperforms state-of-the-art static scheduling baselines by 5\%-14\% in throughput and demonstrates superior resilience to injected system noise and execution variance. We believe our results establish a new paradigm for adaptive pipeline scheduling, moving beyond static plans to achieve true zero-straggler performance in practical, large-scale training environments.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 8272
Loading