Data-Driven Rate Control for RDMA Networks: A Lightweight Online Learning Approach

Published: 01 Jan 2023, Last Modified: 31 Jan 2025ICDCS 2023EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Link speed in datacenter networks (DCNs) keeps growing rapidly, inducing an increasingly large portion of network flows to become short flows which can be finished within one round-trip time (RTT). This phenomenon makes many existing congestion control schemes ineffective because they iteratively adjust the sending rate based on the latest congestion feedback in multiple rounds. We find that the representative DCQCN scheme for RDMA exhibits substantial performance degradation when there are many short flows, and this is specially true in High Performance Computing (HPC) scenarios where most of Message Passing Interface (MPI) messages are small. In this paper, we propose a data-driven rate control framework which can learn from long-term online data about past rate control decisions via a lightweight online learning technique named Multi-Armed Bandit (MAB) which has a provable performance guarantee. Utilizing the framework, we devise a rate control scheme named Dolce-RC, which dynamically controls the rate increase and reduction by learning from online data. We implement Dolce-RC in commodity smart NICs, and show via testbed experiments and large-scale simulations that compared to DCQCN, Dolce-RC reduces average completion time of MPI messages by up to 68%, while not requiring any modification to switches.
Loading