InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24EveryoneRevisionsBibTeX
Keywords: Web Infrastructure, Distributed Training, In-Network Aggregation, Route Selection, Programmable Switches
Abstract: Deep learning has brought about a revolutionary transformation in network applications, particularly in domains like e-commerce and online advertising. Distributed training (DT), as a critical means to expedite model training, has progressively emerged as a key foundational infrastructure for such applications. However, with the rapid advancement of hardware accelerators, the performance bottleneck in DT has shifted from computation to communication. In-network aggregation (INA) solutions have shown promise in alleviating the communication bottleneck. Regrettably, current INA solutions primarily focus on improving efficiency under the traditional PS architecture and do not fully address the communication bottleneck caused by limited PS ingress bandwidth. To bridge this gap, we propose InArt, the first work to introduce INA with routing selection in a multi-PS architecture. InArt employs a multi-PS architecture to split DT tasks among multiple PSs and selects appropriate routing schemes to fully harness INA capabilities. To accommodate traffic dynamics, InArt adopts a two-phase approach: splitting the training model among multiple parameter servers and selecting routing paths for INA. We propose Lagrange multiplier and randomized rounding algorithms for these phases, respectively. We implement InArt and evaluate its performance through experiments on physical platforms (Tofino switches) and mininet emulation (P4 Software Switches). Experimental results show that InArt can reduce communication time by 49% compared with state-of-the-art solutions.
Track: Systems and Infrastructure for Web, Mobile, and WoT
Submission Guidelines Scope: Yes
Submission Guidelines Blind: Yes
Submission Guidelines Format: Yes
Submission Guidelines Limit: Yes
Submission Guidelines Authorship: Yes
Student Author: Yes
Submission Number: 463
Loading