On the Effectiveness of Unified Memory in Multi-GPU Collective Communication

Riccardo Strina, Ian Di Dio Lavore, Marco D. Santambrogio, Michael E. Papka, Zhiling Lan

Published: 01 Jan 2025, Last Modified: 18 Jul 2025PDP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Modern supercomputers are becoming increasingly dense with accelerators. Industry leaders offer multi-GPU architectures with high interconnection bandwidth between the devices to match the requirements of modern workloads. While those technologies advance, it is up to the programmer to successfully exploit them. Recognizing this burden, multiple abstractions have been built. We focus on the NVIDIA Collective Communication Library (NCCL) and Unified Memory (UM). The former provides MPI-like directives integrated within the GPU runtime, allowing lower latencies and increasing the bandwidth over previous approaches. The latter simplifies the programming paradigm, offering a unified virtual address space. Moreover, it enables memory oversubscription, drastically reducing the efforts towards handling larger problems without completely restructuring the codebase. This work provides the first joint analysis of NCCL and UM from single-node multi-GPU architectures to a production supercomputer. We explore all the available collective communication directives concerning their power requirements and overall throughput. Moreover, we study the effects of various hyperparameters, e.g., message sizes, oversubscription level, and memory advice, on the overall obtainable performance. Our findings showcase how using UM brings negligible increased energy consumption; moreover, in distributed settings, other restricting factors, such as network bottlenecks, surpass the overhead introduced by UM’s page-eviction mechanisms.