Abstract: RDMA datacenters are proliferating to meet the demand of emerging workloads such as AI training and inference as well as distributed storage. This trend has opened up a critical knowledge gap: the traffic characteristics of congestion in these networks remain unknown. We do not know, for example, which layers of the network are the most congested, if the network is load balanced effectively, how long congestion events last, and how accurate existing telemetry systems are in capturing congestion. This paper bridges this gap by investigating congestion in a large-scale RDMA datacenter dedicated to distributed AI training. We provide insights into three specific congestion patterns: (a) location and distribution in the network, (b) burstiness, e.g., the duration and synchrony of bursts, and (c) observability using existing telemetry methods. We show, for instance, that the deployment of Priority Flow Control (PFC) in RDMA networks has shifted the location of congestion one level up: from the edge-host in legacy TCP/IP datacenters to the network core in RDMA datacenters. At the same time, we show that the same protocol enables us to observe and understand congestion better, even bursty events. The findings of this research reveal open challenges for measuring, characterizing, and managing congestion in RDMA networks, paving the way for future research.
External IDs:doi:10.1145/3730567.3764494
Loading