Token-Aligned Hierarchies for Lightweight Super-Resolution

Published: 27 Apr 2026, Last Modified: 27 Apr 2026EDGE PosterEveryoneRevisionsCC BY 4.0
Keywords: Single-Image Super-Resolution, Hierarchical Convolutional Architecture, Efficient
Paper Track: Long Paper (archival)
TL;DR: Swin-HIER is a hierarchical, token-aligned super-resolution model that replaces windowed self-attention with lightweight depthwise and pointwise convolutions, delivering transformer-level quality with much lower latency and memory use.
Abstract: Windowed self-attention (WSA) has become a strong backbone for single-image super-resolution (SR), yet its high overhead often leads to latency inefficiency. We revisit Swin-style SR from a hierarchical perspective and introduce a token-aligned encoder–decoder built entirely with grouped and depthwise convolutions, replacing attention windows with efficient spatial mixing. Our architecture preserves the locality bias of WSA while substantially improving speed and stability. It incorporates (i) symmetry-preserving padding for consistent token partitioning, (ii) a token pyramid that expands channels through patch merging to aggregate broader context, and (iii) Token-Aligned Skip Fusion (TASF) for precise multi-scale feature reuse. Built upon the SwinIR hierarchy, our model attains both the relatively high reconstruction quality (PSNR $\approx$ 37.8 dB for $\times2$) and the lowest latency among all compared methods, including faster inference than SwinIR-light while maintaining strong texture consistency and low memory usage. These results demonstrate that hierarchical, convolution-based modeling can match or surpass transformer performance at a fraction of the cost, making our design highly suitable for real-time and edge SR applications.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 7
Loading