TOPO-X: Co-optimize Flow Scheduling, Topology, and ML Training Parallelism

Published: 2025, Last Modified: 18 Sept 2025ICCCN 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rapid advancement of large-scale deep neural networks and large language models has intensified the demand for highly efficient GPU clusters. However, existing distributed training frameworks, like Fat-tree and TopoOpt, struggle with inefficient resource utilization and network bottlenecks. They often optimize communication, parallelism, and network topology independently, failing to leverage their interdependencies. To address this gap, we propose TOPO-X, a novel reconfigurable network framework that co-optimizes flow scheduling, training parallelism, and optical network topology. By formulating this integrated optimization challenge as a Resource-Constrained Project Scheduling Problem, TOPO-X dynamically adapts to changing workloads and network conditions using optical network reconfiguration capabilities. Our experimental results show that TOPO-X outperforms the state-of-the-art solution, TopoOpt, achieving a 2.22× speedup in training iteration times on average. These findings highlight TOPO-X as a promising approach for scalable, adaptive, and high-performance GPU clusters designed to meet the increasing demands of large-scale AI training workloads.
Loading