Accelerating Allreduce With In-Network Reduction on Intel PIUMA

Kartik Lakhotia, Fabrizio Petrini, Rajgopal Kannan, Viktor K. Prasanna

2022 (modified: 24 Apr 2023)IEEE Micro 2022Readers: Everyone

Abstract: The Intel Programmable Integrated Unified Memory Architecture (PIUMA) system maps collective operations directly into the network switches and supports pipelined embeddings for high-throughput collective computation. Utilizing these features and PIUMA’s network topology, we develop a methodology to generate extremely low latency embeddings for in-network Allreduce. Our analysis shows that the proposed in-network Allreduce is highly scalable, with less than 1.5-μs single-element latency on 16K nodes. Compared to host-based Allreduce, it exhibits 36× less latency and 3.5× higher throughput. With deep neural network training as an example, we further demonstrate the benefits of our in-network Allreduce on end-user applications.

0 Replies