Soft Constraints, Strong Solutions: Optimizing Intra-Operator Parallelism for Distributed Deep Learning

ICLR 2026 Conference Submission20718 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: distributed deep learning, intra-operator parallelism, competition, cost function network
TL;DR: We present a scalable relaxation-based solver for intra-operator parallelism that consistently produces low-cost solutions under tight time budgets and outperforms XLA by up to several orders of magnitude.
Abstract: As deep learning models continue to increase in size and complexity, mapping their computations efficiently onto distributed hardware has become a central challenge in systems and compiler design. A key technique for addressing this challenge is intra-operator parallelism, which involves partitioning individual operations across multiple devices. This enables large-scale models to make more effective use of available hardware while satisfying strict memory and communication constraints. The ASPLOS / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning formalized this challenge as a constrained combinatorial optimization problem, requiring a strategy assignment for each graph node that minimizes total cost while respecting time-varying memory limits. This paper presents the top-performing solution to the contest, based on usage-constrained relaxation, which incorporates memory usage directly into the cost model rather than enforcing it as a hard constraint. Together with adaptive weight tuning, the method guarantees valid assignments and scales to computation graphs with tens of thousands of nodes. The solver achieves state-of-the-art results across all contest benchmarks, consistently producing low-cost solutions within strict time limits. The paper details the core algorithmic components and discusses their broader applicability to compiler-level optimization in distributed deep learning.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 20718
Loading