Keywords: depth estimation, segment hierarchy, vision transformer
Abstract: Monocular depth estimation is a dense prediction task that infers per-pixel depth from a single image, fundamental to 3D perception and robotics. There are extensively strong depth foundation models, supported by a backbone pre-trained with a massive scale of data. However, do these depth foundation models really understand the structure? Although real-world scenes exhibit strong structure, these methods treat it as an independent pixel-wise regression problem, often resulting in structural inconsistencies in depth maps, such as ambiguous object shapes. We propose SHED, a novel encoder-decoder architecture that enforces geometric prior explicitly from spatio-layout by incorporating segmentation into depth estimation. Inspired by the bidirectional hierarchical reasoning in human perception, SHED redesigns the vision transformer by replacing fixed patch tokens with segment tokens, which are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, and the intermediate segment hierarchy emerges naturally without explicit supervision. SHED offers three key advantages. First, it improves depth boundaries and segment coherence, and demonstrates robust cross-domain generalization. Second, it enables features and segments to better capture global scene layout. Third, it enhances 3D reconstruction and reveals part structures that conventional pixel-wise methods fail to capture.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Submission Number: 19778
Loading