Keywords: Decentralized Training, Asynchronous Pipeline Parallel
TL;DR: We analyze gradient staleness in the asynchronous setting of SWARM and propose a weight correction technique using Nesterov Accelerated Gradient (NAG)
Abstract: SWARM parallelism is a framework that enhances pipeline parallelism in distributed training by incorporating fault tolerance. However, the synchronous nature of this approach introduces inefficiencies that can hinder performance and scalability. We analyze these inefficiencies and propose an asynchronous modification to the framework that enables nodes to perform local updates and periodically average their states. Our results demonstrate that this modified asynchronous SWARM achieves higher throughput without sacrificing model convergence.
Submission Number: 32
Loading