Stream-K: Work-Centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

Muhammad Osama, Duane Merrill, Cris Cecka, Michael Garland, John D. Owens

Published: 2023, Last Modified: 06 Nov 2024PPoPP 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.