Transforming time and space: efficient video super-resolution with hybrid attention and deformable transformers
Abstract: Space-time video super-resolution (STVSR) aims to enhance low frame rate (LFR) and low resolution (LR) videos into high frame rate (HFR) and high resolution (HR) outputs. Traditional two-stage methods decompose STVSR into video super-resolution (VSR) and video frame interpolation (VFI), resulting in significant computational overhead. The challenge of designing efficient and high-performance one-stage STVSR methods remains unsolved. While Transformer-based one-stage approaches have shown promise by processing frames in parallel and effectively capturing temporal dependencies, they suffer from large model sizes, hindering practical applications. Key to optimizing such methods is the effective utilization of extracted features, as improper feature management can degrade performance. In this work, we propose a novel one-stage STVSR framework, termed DHAT, which leverages guided deformable attention (GDA) and hybrid attention mechanisms. In the feature propagation stage, we introduce a recurrent feature refinement module based on GDA, balancing parallelism with recurrent processing. Additionally, we design a hybrid attention block that combines cross-attention and channel attention, enabling refined spatiotemporal feature aggregation. The cross-attention mechanism plays a pivotal role in fusing multi-scale temporal information across frames. Extensive experiments demonstrate that DHAT outperforms state-of-the-art methods on several benchmark datasets, achieving superior performance as evidenced by higher PSNR and SSIM scores.
External IDs:dblp:journals/vc/JiangWZZ25
Loading