Abstract: Sparse matrix-vector multiplication (SpMV) is one of the key operations in linear algebra. Overcoming thread divergence, load imbalance and non-coalesced and indirect memory access due to sparsity and irregularity are challenges to optimizing SpMV on GPUs.In this paper we present a new blocked row-column (BRC) storage format with a novel two-dimensional blocking mechanism that effectively addresses the challenges: it reduces thread divergence by reordering and grouping rows of the input matrix with nearly equal number of non-zero elements onto the same execution units (i.e., warps). BRC improves load balance by partitioning rows into blocks with a constant number of non-zeros such that different warps perform the same amount of work. We also present an efficient auto-tuning technique to optimize BRC performance by judicious selection of block size based on sparsity characteristics of the matrix. A CUDA implementation of BRC outperforms NVIDIA CUSP and cuSPARSE libraries and other state-of-the-art SpMV formats on a range of unstructured sparse matrices from multiple application domains. The BRC format has been integrated with PETSc, enabling its use in PETSc's solvers.
Loading