Soft Constraints, Strong Solutions: Optimizing Intra-Operator Parallelism for Distributed Deep Learning

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: distributed deep learning, intra-operator parallelism, competition, cost function network
TL;DR: We present a scalable relaxation-based solver for intra-operator parallelism that consistently produces low-cost solutions under tight time budgets and outperforms XLA by up to several orders of magnitude.
Abstract: As deep learning models grow in size and complexity, efficiently mapping their computations onto distributed hardware is a central challenge for systems and compiler design. A key technique for addressing this challenge is intra-operator parallelism, which partitions individual operations across multiple devices. To accelerate research on automated intra-operator parallelism, Google curated a benchmark suite of 25 large-scale instances drawn from real production workloads including Graph Network Simulators, U-Nets, diffusion models, and Gemma 1 and Gemma 2 language models, and organized the ASPLOS/EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning. The contest formalized intra-operator parallelism as a constrained combinatorial optimization problem in which each computational-graph node must be assigned an execution strategy that minimizes compute and communication cost while satisfying strict time-varying memory limits. This paper presents the winning solution. We show that relaxing the hard memory constraints enables the problem to be reformulated as a Cost Function Network optimization task. Building on this idea, we develop a solver that combines adaptive penalty-based relaxation with efficient Cost Function Network optimization. The method quickly produces feasible strategies with costs near the global optimum on nearly all benchmark instances, consistently outperforming XLA, the production-grade compiler used in TensorFlow and JAX, often by orders of magnitude.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 20718
Loading