Fast Inference with Kronecker-Sparse Matrices

Antoine Gonon; Léon Zheng; Pascal Carrivain; TUNG QUOC LE

Fast Inference with Kronecker-Sparse Matrices

Antoine Gonon, Léon Zheng, Pascal Carrivain, TUNG QUOC LE

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Improves GPU matrix multiplication algorithms for Kronecker-sparse matrices

Abstract: Kronecker-sparse (KS) matrices—whose supports are Kronecker products of identity and all-ones blocks—underpin the structure of Butterfly and Monarch matrices and offer the promise of more efficient models. However, existing GPU kernels for KS matrix multiplication suffer from high data movement costs, with up to 50% of time spent on memory-bound tensor permutations. We propose a fused, output-stationary GPU kernel that eliminates these overheads, reducing global memory traffic threefold. Across 600 KS patterns, our kernel achieves in FP32 a median speedup of x1.4 and lowers energy consumption by 15%. A simple heuristic based on KS pattern parameters predicts when our method outperforms existing ones. We release all code at [github.com/PascalCarrivain/ksmm](https://github.com/PascalCarrivain/ksmm), including a PyTorch-compatible *KSLinear* layer, and demonstrate in FP32 end-to-end latency reductions of up to 22% in ViT-S/16 and 16% in GPT-2 medium.

Lay Summary: Today’s AI models are costly to run because they spend most of their time and energy multiplying huge tables of numbers. One way to speed this up is to use tables with lots of zeros, since multiplying by zero does nothing and can be skipped. If we know where the zeros are in advance, we can organize the work to skip them efficiently—but how we skip them matters. In this work, we focus on a promising zero pattern that has been widely studied. We show that, even with this pattern, the fastest current Graphical Processing Unit (GPU) methods waste up to half their time just shuffling numbers around before doing the real calculations. While this shuffling is meant to help (to reorganize the table in big chunks that can be skipped), our work reveals that it is unnecessary, and we propose a faster workaround. By keeping the table data in place and changing how the GPU handles the work, we skip the zeros without extra shuffling. Our method runs up to 40% faster and uses 15% less energy across 600 test cases. We provide open-source code for both GPUs and CPUs, plus a one-line rule of thumb to tell when it helps. Plugged into large AI models like Transformers, it cuts total running time by up to 22%, paving the way for faster, cheaper AI.

Link To Code: https://github.com/PascalCarrivain/ksmm

Primary Area: Deep Learning->Algorithms

Keywords: GPU matrix multiplication, Kronecker-sparse

Submission Number: 4786

Loading