Unlocking the Full Potential of Separable Convolutions on Tensor Cores

Aodie Cui, Chuangxin Zhao, Xiaoyu Deng, Gaozhe Jiang, Yifan Yang, Guangzhen Yao, Renda Han, Wenxin Zhang, Xi Xuan

Published: 2025, Last Modified: 04 Feb 2026ICIC (16) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: While separable convolutions have demonstrated great performance in network design, they suffer from poor efficiency on Tensor Cores-equipped GPUs. This paper proposes TensorFuse, which exploits the Tensor Cores by transforming nested-loops into hierarchy matrix multiplications for kernel fusion. TensorFuse minimizes redundant memory accesses by efficiently lowering GEMM-based convolution along execution hierarchy: from shared memory to register files. Compared with the state-of-the-art, it achieves up to 2.60 × inference speed on Tensor Cores. Furthermore, we explore the performance of network decoupling with multiple separable convolutions. TensorFuse consistently outperforms state-of-the-art libraries with up to 2.76 × acceleration on modern CNN benchmarks.

External IDs:dblp:conf/icic/CuiZDJYYHZX25