Communication Optimization of Iterative Sparse Matrix-Vector Multiply on GPUs and FPGAs

Abid Rafique, George A. Constantinides, Nachiket Kapre

Published: 2015, Last Modified: 12 May 2025IEEE Trans. Parallel Distributed Syst. 2015EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Trading communication with redundant computation can increase the silicon efficiency of FPGAs and GPUs in accelerating communication-bound sparse iterative solvers. While $k$ iterations of the iterative solver can be unrolled to provide $O(k)$ reduction in communication cost, the extent of this unrolling depends on the underlying architecture, its memory model, and the growth in redundant computation. This paper presents a systematic procedure to select this algorithmic parameter $k$, which provides communication-computation tradeoff on hardware accelerators like FPGA and GPU. We provide predictive models to understand this tradeoff and show how careful selection of $k$ can lead to performance improvement that otherwise demands significant increase in memory bandwidth. On an Nvidia C2050 GPU, we demonstrate a 1.9$\times$-42.6$\times$ speedup over standard iterative solvers for a range of benchmarks and that this speedup is limited by the growth in redundant computation. In contrast, for FPGAs, we present an architecture-aware algorithm that limits off-chip communication but allows communication between the processing cores. This reduces redundant computation and allows large $k$ and hence higher speedups. Our approach for FPGA provides a 0.3$\times$-4.4$\times$ speedup over same-generation GPU devices where $k$ is picked carefully for both architectures for a range of benchmarks.