Exploring Asynchronism in SWARM Parallelism

Yan Zuo; Gil Avraham; Thalaiyasingam Ajanthan; Sameera Ramasinghe; Alexander Long

Exploring Asynchronism in SWARM Parallelism

Yan Zuo, Gil Avraham, Thalaiyasingam Ajanthan, Sameera Ramasinghe, Alexander Long

Published: 06 Mar 2025, Last Modified: 02 Apr 2025MCDC @ ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Decentralized Training, Asynchronous Pipeline Parallel

TL;DR: We analyze gradient staleness in the asynchronous setting of SWARM and propose a weight correction technique using Nesterov Accelerated Gradient (NAG)

Abstract: SWARM parallelism is a framework that enhances pipeline parallelism in distributed training by incorporating fault tolerance. However, the synchronous nature of this approach introduces inefficiencies that can hinder performance and scalability. We analyze these inefficiencies and propose an asynchronous modification to the framework that enables nodes to perform local updates and periodically average their states. Our results demonstrate that this modified asynchronous SWARM achieves higher throughput without sacrificing model convergence.

Submission Number: 32

Loading