The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos

Published: 20 Aug 2025, Last Modified: 16 Oct 2025SP4VEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision Foundation Models, 3D, Videos
Abstract: Image-based vision foundation models (VFMs) have demonstrated surprising 3D geometric awareness, despite no explicit 3D supervision or pre-training on multi-view data. While image-based models are widely adopted across a range of downstream tasks, video-based models have so far remained on the sidelines of this success. In this work, we conduct a comparative study of image and video models on three tasks encapsulating 3D awareness: multi-view consistency, depth and surface normal estimation. To enable a fair and reproducible evaluation of both image and video models, we develop AnyProbe, a unified framework for probing network representations. The results of our study reveal a surprising conclusion, which we refer to as the diashow paradox. Specifically, video-based pre-training does not provide any consistent advantage on downstream tasks involving 3D understanding over image-based pre-training. We formulate two hypotheses to explain our observations, which underscore the need for high-quality video datasets and highlight the inherent complexity of video-based pre-training. AnyProbe will be publicly released to streamline evaluation of image- and video-based VFMs alike in a consistent fashion.
Submission Number: 9
Loading