Can Monocular Single-Depth Foundation Models Generate Multi-Depth Hypotheses?

Xiaohao Xu; Feng Xue; Xiang Li; Haowei Li; Shusheng Yang; Tianyi Zhang; Matthew Johnson-Roberson; Xiaonan Huang

Can Monocular Single-Depth Foundation Models Generate Multi-Depth Hypotheses?

Xiaohao Xu, Feng Xue, Xiang Li, Haowei Li, Shusheng Yang, Tianyi Zhang, Matthew Johnson-Roberson, Xiaonan Huang

04 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Depth Foundation Model; Visual Prompting; 3D Spatial Understanding

TL;DR: Depth models trained for single outputs hold latent multi-layer cues, unlockable via Laplacian Visual Prompting.

Abstract: Monocular depth foundation models underpin modern 3D perception, yet they are mainly trained under a restrictive paradigm that predicts a single deterministic depth value per pixel. This formulation assumes that every image ray intersects only one surface, an assumption that fails in transparent or multi-layer scenes. This raises our central question: can models built for single-depth prediction generate multi-depth hypotheses? We find that they can. Beneath their deterministic outputs lies latent multi-layer structure, a hidden geometry obscured by the training objective rather than absent from the model itself. To uncover it, we introduce Laplacian Visual Prompting (LVP), a lightweight input-space perturbation that elicits complementary depth hypotheses from a frozen model without retraining. On our new MD-3k benchmark of ambiguous scenes, LVP consistently decouples foreground and background layers, showing that a single off-the-shelf depth foundation model can be reprogrammed into a multi-hypothesis estimator. These results reveal that the geometric capacity of depth foundation models is richer than their standard outputs suggest, and open a new path toward probing and harnessing hidden representations for more complete 3D understanding under ambiguity.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1835

Loading