Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu; Chin-Yang Lin; Zhixiang Wang; Chi-Wei Hsiao; Po-Fan Yu; Yu-Chih Chen; Yu-Lun Liu

Stream-DiffVSR: Low-Latency Streamable Video Super-Resolution via Auto-Regressive Diffusion

Hau-Shiang Shiu, Chin-Yang Lin, Zhixiang Wang, Chi-Wei Hsiao, Po-Fan Yu, Yu-Chih Chen, Yu-Lun Liu

09 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Super-Resolution, Online Video Restoration, Video restoration, Diffusion models

Abstract: Diffusion-based video super-resolution (VSR) methods have recently demonstrated remarkable perceptual quality; however, their reliance on future-frame information and computationally expensive iterative denoising has restricted their application in latency-sensitive contexts. We present Stream-DiffVSR, a causally conditioned diffusion VSR framework designed for efficient online inference. Our method operates strictly with past frames and integrates three key components: a four-step distilled denoiser, an auto-regressive temporal guidance (ARTG) module that injects motion-aligned temporal cues into the denoising process, and a lightweight temporal-aware decoder with temporal processor module (TPM) that enhances spatial detail and temporal consistency. Stream-DiffVSR processes 720p frames in just 0.328 seconds on an RTX 4090 GPU, significantly outperforming previous diffusion-based methods. Compared with state-of-the-art online methods such as TMP, Stream-DiffVSR achieves a substantial improvement in perceptual quality (LPIPS improved by 0.095) while reducing inference latency by more than 130X relative to previous diffusion-based VSR approaches. These results demonstrate the potential of diffusion models for practical deployment in time-sensitive rendering pipelines and real world video super-resolution systems. Notably, Stream-DiffVSR achieves the lowest latency ever reported among diffusion-based VSR methods, reducing the initial delay from over 4600 seconds to just 0.328 seconds. This makes it the first diffusion-based solution viable for real-time online deployment.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3456

Loading