Memory Hierarchy Optimizations and Performance ounds for Sparse A

Rich Vuduc, Attila Gyulassy, James Demmel, Katherine A. Yelick

Published: 2003, Last Modified: 12 May 2023International Conference on Computational Science 2003Readers: Everyone

Abstract: This paper presents uniprocessor performance optimizations, automatic tuning techniques, and an experimental analysis of the sparse matrix operation, y = A T Ax, where A is a sparse matrix and x, y are dense vectors. We describe an implementation of this computational kernel which brings A through the memory hierarchy only once, and which can be combined naturally with the register blocking optimization previously proposed in the Sparsity tuning system for sparse matrix-vector multiply. We evaluate these optimizations on a benchmark set of 44 matrices and 4 platforms, showing speedups of up to 4.2×. We also develop platform-specific upper-bounds on the performance of these implementations. We analyze how closely we can approach these bounds, and show when low-level tuning techniques (e.g., better instruction scheduling) are likely to yield a significant pay-o. Finally, we propose a hybrid o.-line/run-time heuristic which in practice automatically selects near-optimal values of the key tuning parameters, the register block sizes.

0 Replies