The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos

Nguyen Tien Duc; Anna Sonnweber; Mark Weber; Nikita Araslanov; Daniel Cremers

The Diashow Paradox: Stronger 3D-Aware Representations Emerge from Image Sets, Not Videos

Nguyen Tien Duc, Anna Sonnweber, Mark Weber, Nikita Araslanov, Daniel Cremers

Published: 20 Aug 2025, Last Modified: 19 Oct 2025SP4VEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Foundation Models, 3D, Videos

Abstract: Image-based vision foundation models (VFMs) have demonstrated surprising 3D geometric awareness, despite no explicit 3D supervision or pre-training on multi-view data. While image-based models are widely adopted across a range of downstream tasks, video-based models have so far remained on the sidelines of this success. In this work, we conduct a comparative study of image and video models on three tasks encapsulating 3D awareness: multi-view consistency, depth and surface normal estimation. To enable a fair and reproducible evaluation of both image and video models, we develop AnyProbe, a unified framework for probing network representations. The results of our study reveal a surprising conclusion, which we refer to as the diashow paradox. Specifically, video-based pre-training does not provide any consistent advantage on downstream tasks involving 3D understanding over image-based pre-training. We formulate two hypotheses to explain our observations, which underscore the need for high-quality video datasets and highlight the inherent complexity of video-based pre-training. AnyProbe will be publicly released to streamline evaluation of image- and video-based VFMs alike in a consistent fashion.

Supplementary Material: zip

Submission Number: 9

Loading