Unlocking the Full Potential of Separable Convolutions on Tensor Cores

Published: 01 Jan 2025, Last Modified: 04 Nov 2025ICIC (16) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: While separable convolutions have demonstrated great performance in network design, they suffer from poor efficiency on Tensor Cores-equipped GPUs. This paper proposes TensorFuse, which exploits the Tensor Cores by transforming nested-loops into hierarchy matrix multiplications for kernel fusion. TensorFuse minimizes redundant memory accesses by efficiently lowering GEMM-based convolution along execution hierarchy: from shared memory to register files. Compared with the state-of-the-art, it achieves up to 2.60 × inference speed on Tensor Cores. Furthermore, we explore the performance of network decoupling with multiple separable convolutions. TensorFuse consistently outperforms state-of-the-art libraries with up to 2.76 × acceleration on modern CNN benchmarks.
Loading