Can Monocular Single-Depth Foundation Models Generate Multi-Depth Hypotheses?

04 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Depth Foundation Model; Visual Prompting; 3D Spatial Understanding
TL;DR: Depth models trained for single outputs hold latent multi-layer cues, unlockable via Laplacian Visual Prompting.
Abstract: Monocular depth foundation models underpin modern 3D perception, yet they are mainly trained under a restrictive paradigm that predicts a single deterministic depth value per pixel. This formulation assumes that every image ray intersects only one surface, an assumption that fails in transparent or multi-layer scenes. This raises our central question: can models built for single-depth prediction generate multi-depth hypotheses? We find that they can. Beneath their deterministic outputs lies latent multi-layer structure, a hidden geometry obscured by the training objective rather than absent from the model itself. To uncover it, we introduce Laplacian Visual Prompting (LVP), a lightweight input-space perturbation that elicits complementary depth hypotheses from a frozen model without retraining. On our new MD-3k benchmark of ambiguous scenes, LVP consistently decouples foreground and background layers, showing that a single off-the-shelf depth foundation model can be reprogrammed into a multi-hypothesis estimator. These results reveal that the geometric capacity of depth foundation models is richer than their standard outputs suggest, and open a new path toward probing and harnessing hidden representations for more complete 3D understanding under ambiguity.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1835
Loading