Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Matthew Dutson; Nathan Labiosa; Yin Li; Mohit Gupta

Instant Video Models: Universal Adapters for Stabilizing Image-Based Networks

Matthew Dutson, Nathan Labiosa, Yin Li, Mohit Gupta

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: computer vision, temporal consistency, robustness, video, denoising, image enhancement, depth estimation, semantic segmentation

TL;DR: We propose a general approach for modifying image-based models to improve robustness and temporal consistency on video.

Abstract: When applied sequentially to video, frame-based networks often exhibit temporal inconsistency—for example, outputs that flicker between frames. This problem is amplified when the network inputs contain time-varying corruptions. In this work, we introduce a general approach for adapting frame-based models for stable and robust inference on video. We describe a class of stability adapters that can be inserted into virtually any architecture and a resource-efficient training process that can be performed with a frozen base network. We introduce a unified conceptual framework for describing temporal stability and corruption robustness, centered on a proposed accuracy-stability-robustness loss. By analyzing the theoretical properties of this loss, we identify the conditions where it produces well-behaved stabilizer training. Our experiments validate our approach on several vision tasks including denoising (NAFNet), image enhancement (HDRNet), monocular depth (Depth Anything v2), and semantic segmentation (DeepLabv3+). Our method improves temporal stability and robustness against a range of image corruptions (including compression artifacts, noise, and adverse weather), while preserving or improving the quality of predictions.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 4137

Loading