Maximizing Aggregation Throughput for Distributed Training with Constrained In-Network Computing

Long Luo, Shulin Yang, Hao Wu, Hongfang Yu, Bo Lei, Shuai Gao

Published: 2023, Last Modified: 12 May 2025ICC 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Distributed training (DT) has become an important and popular practice for collaborative training of high-quality machine learning (ML) models. The communication efficiency of gradient aggregation has been shown to be the primary performance bottleneck for distributed training today. Advanced programmable switches with in-network computing capabilities provide a promising direction for improving the communication efficiency of DT by offloading some gradient aggregations from the host to switches in the network. In this paper, we propose SPAR to optimize the performance of gradient aggregation under constrained in-network computing capabilities. To improve the aggregation throughput, SPAR jointly optimizes the deployment of in-network aggregation switches and the routing of aggregation requests from workers. We formulate this joint optimization problem as an integer nonlinear programming problem and design an efficient greedy algorithm to compute solutions quickly. The experimental results show that SPAR significantly outper-forms the other state-of-the-art solutions based on in-network aggregation, improving aggregation throughput by up to 3×.