SRConvNet: A Transformer-Style ConvNet for Lightweight Image Super-Resolution

Feng Li, Runmin Cong, Jingjing Wu, Huihui Bai, Meng Wang, Yao Zhao

Published: 01 Jan 2025, Last Modified: 28 Apr 2025Int. J. Comput. Vis. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, vision transformers have demonstrated their superiority against convolutional neural networks (ConvNet) in various tasks including single-image super-resolution (SISR). The success of transformers can be attributed to the indispensable multi-head self-attention (MHSA) mechanism, which enables to effectively model global connectivity with fewer parameters. However, the quadratic complexity of MHSA usually encounters vast computation costs and memory resource occupation, limiting their efficient deployment on mobile devices compared to widely used lightweight ConvNets. In this work, we thoroughly explore the key differences between ConvNet- and transformer-based SR models, thus presenting SRConvNet that absorbs both the merits for lightweight SISR. Our SRConvNet is accomplished by two primary designs: (1) the Fourier modulated attention (FMA), an MHSA-like but more computationally and parametrically efficient operator that performs regional frequency-spatial modulation and aggregation to ensure long-term and short-term dependencies modeling; (2) the dynamic mixing layer (DML) utilizing mixed-scale depthwise dynamic convolution with channel splitting and shuffling to explore multi-scale contextualized information for model locality and adaptability enhancement. Combining FMA and DFN, we can build a pure transformer-style ConvNet to compete with the best lightweight SISR models in the trade-off between efficiency and accuracy. Extensive experiments demonstrate that SRConvNet can achieve more efficient SR reconstruction than recent state-of-the-art lightweight SISR methods on both computation and parameters while preserving comparable performance. Code is available at https://github.com/lifengcs/SRConvNet.