D-DOSA:DPU-Based Dataflow Offloading and Sparse Allreduce Framework for Distributed Training

Zhenqi Yu, Wenjing Li, Shaoyong Guo, Qingfeng Li, Feng Qi, Jiapeng Xiu

Published: 01 Jan 2026, Last Modified: 13 Mar 2026IEEE Transactions on Cloud ComputingEveryoneRevisionsCC BY-SA 4.0

Abstract: Communication overhead represents a primary bottleneck in distributed deep learning, impeding training scalability. Although existing gradient sparsification techniques reduce network traffic, they introduce critical limitations: they fail to optimize intra-node data paths and are incompatible with efficient, decentralized Allreduce operations. To address these issues, we propose D-DOSA, a DPU-based communication offloading framework. D-DOSA incorporates two key innovations: 1) D-DO, an architecture that establishes a direct GPU-DPU data path to offload data loading and intra-node communication from the host CPU; and 2) D-SA, a novel sparse Allreduce algorithm that, for the first time, enables compatibility between sparse tensors and high-performance, ring-based communication. We evaluated D-DOSA on a 8-node, DPU-enabled cluster using representative models including VGG, LSTM, and BERT. Experimental results demonstrate that our framework accelerates training by up to 1.32x compared to the state-of-the-art sparse training baseline, without compromising accuracy. Ultimately, D-DOSA shows that co-designing data-flow architectures and communication algorithms on the DPU resolves key bottlenecks in sparse training and presents a viable path toward scalable performance in larger systems.

External IDs:doi:10.1109/tcc.2025.3634564