InArt:  In-Network Aggregation with Route Selection  for Accelerating Distributed Training

Jiawei Liu; Yutong Zhai; Gongming Zhao; Hongli Xu; Jin Fang; Zhen Zeng; Ying Zhu

InArt: In-Network Aggregation with Route Selection for Accelerating Distributed Training

Jiawei Liu, Yutong Zhai, Gongming Zhao, Hongli Xu, Jin Fang, Zhen Zeng, Ying Zhu

Published: 23 Jan 2024, Last Modified: 23 May 2024TheWebConf24EveryoneRevisionsBibTeX

Keywords: Web Infrastructure, Distributed Training, In-Network Aggregation, Route Selection, Programmable Switches

Abstract: Deep learning has brought about a revolutionary transformation in network applications, particularly in domains like e-commerce and online advertising. Distributed training (DT), as a critical means to expedite model training, has progressively emerged as a key foundational infrastructure for such applications. However, with the rapid advancement of hardware accelerators, the performance bottleneck in DT has shifted from computation to communication. In-network aggregation (INA) solutions have shown promise in alleviating the communication bottleneck. Regrettably, current INA solutions primarily focus on improving efficiency under the traditional PS architecture and do not fully address the communication bottleneck caused by limited PS ingress bandwidth. To bridge this gap, we propose InArt, the first work to introduce INA with routing selection in a multi-PS architecture. InArt employs a multi-PS architecture to split DT tasks among multiple PSs and selects appropriate routing schemes to fully harness INA capabilities. To accommodate traffic dynamics, InArt adopts a two-phase approach: splitting the training model among multiple parameter servers and selecting routing paths for INA. We propose Lagrange multiplier and randomized rounding algorithms for these phases, respectively. We implement InArt and evaluate its performance through experiments on physical platforms (Tofino switches) and mininet emulation (P4 Software Switches). Experimental results show that InArt can reduce communication time by 49% compared with state-of-the-art solutions.

Track: Systems and Infrastructure for Web, Mobile, and WoT

Submission Guidelines Scope: Yes

Submission Guidelines Blind: Yes

Submission Guidelines Format: Yes

Submission Guidelines Limit: Yes

Submission Guidelines Authorship: Yes

Student Author: Yes

Submission Number: 463

Loading