Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Keywords: self-supervised image-pretraining
TL;DR: We improve vision encoder reliability for dynamic scenes with a simple self-supervised task: predicting the next video frame's dense features to instill physical and temporal priors.
Abstract: Foundation models deployed in dynamic domains like robotics and autonomous systems suffer from critical reliability failures, including temporal inconsistencies and vulnerability to sensor noise, stemming from their training on static, disconnected images. To bridge this reliability gap, we propose a lightweight, reliability-aware training paradigm that distills temporal knowledge from video into a standard single-image encoder. By training a predictor to estimate the feature representation of a future frame, our method implicitly forces the backbone model to learn real-world dynamics, enhancing robustness to transient visual artifacts and promoting temporally stable representations. This self-supervised objective instills geometric and physical priors without relying on brittle external modules like optical flow estimators. Remarkably, when pre-trained on only a single, 2-hour uncurated video, our method achieves state-of-the-art among DINO-style approaches on downstream tasks like detection and segmentation, which we use as quantifiable proxies for robust scene understanding. Our work presents a practical and efficient approach for improving the trustworthiness and dependable performance of vision encoders for safe deployment in operational settings.
Submission Track: Workshop Paper Track
Submission Number: 29
Loading