Abstract: Recent monocular depth estimation systems increasingly benefit from strong pretrained visual encoders and large-scale training data. However, once high-quality dense features are available, two remaining sources of cost become especially important: the complexity of Dense Prediction Transformer (DPT)-style multi-branch decoders and the quality of the depth samples used for training. We present AnyDepth, an efficient framework for zero-shot monocular depth estimation that studies these two design choices under controlled settings. AnyDepth uses a frozen DINOv3 encoder and replaces the conventional reassemble-then-fuse DPT head with a Simple Depth Transformer (SDT), a single-path decoder that fuses projected multi-layer tokens before spatial reconstruction. SDT further combines lightweight local refinement with learnable progressive upsampling to improve detail preservation without introducing multi-branch feature alignment. In parallel, we introduce two depth-specific sample quality scores, based on depth distribution and gradient continuity, to filter low-quality training samples before optimization. Across standard indoor, outdoor, synthetic, and robot-scene benchmarks, SDT improves the efficiency-accuracy trade-off relative to DPT under matched encoder and training settings, reducing decoder parameters by 86.6\%--89.2\% while lowering computational cost and edge-device latency. The filtering strategy reduces the merged training set from 584K to 369K samples and preserves or improves several metrics under controlled comparisons. These results suggest that, in the era of strong frozen visual encoders, decoder simplicity and data quality remain practical control points for reproducible and deployable zero-shot depth estimation.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Salman_Asif1
Submission Number: 9308
Loading