Conductor: Dynamically Orchestrating Pipeline Parallelism with Multi-Granularity Control

Ziyuan Liu; Xingbo Dong; Le Xue; Yuezhe Yang; kevinchenkai; Zhe Jin

Conductor: Dynamically Orchestrating Pipeline Parallelism with Multi-Granularity Control

Ziyuan Liu, Xingbo Dong, Le Xue, Yuezhe Yang, kevinchenkai, Zhe Jin

17 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: distributed deep learning, pipeline parallelism, large-scale training, Distributed Systems

Abstract: Pipeline parallelism is a cornerstone of large-scale model training, yet its efficiency is fundamentally limited by straggler-induced pipeline bubbles. This issue is exacerbated by static scheduling approaches, including handcrafted heuristics and Integer Linear Programming (ILP), that are inherently brittle to real-world execution time variance. In this work, we introduce \textsc{Conductor}, a dynamic, two-tiered scheduling framework that, to our knowledge, is the first to virtually eliminate straggler-induced bubbles under realistic, stochastic conditions. The key insight is to decouple global, long-horizon scheduling from local, instantaneous load balancing. At a \textbf{coarse grain}, a reinforcement learning (RL) agent leverages millisecond-scale inference to generate robust global schedules, adapting to runtime dynamics in scenarios where traditional static solvers are intractable. At a \textbf{fine grain}, we introduce a dynamic computation migration mechanism that resolves residual micro-bubbles by offloading sub-computations, such as attention heads, from transiently slower to faster devices within a single timestep. Evaluated on large-scale LLM training configurations, our framework outperforms state-of-the-art static scheduling baselines by 5\%-14\% in throughput and demonstrates superior resilience to injected system noise and execution variance. We believe our results establish a new paradigm for adaptive pipeline scheduling, moving beyond static plans to achieve true zero-straggler performance in practical, large-scale training environments.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 8272

Loading