SHED Light on Segmentation for Depth Estimation

Published: 20 Aug 2025, Last Modified: 20 Aug 2025SP4VEveryoneRevisionsBibTeXCC BY 4.0
Keywords: dense prediction, depth estimation, segment hierarchy, vision transformer
Abstract: Monocular depth estimation is a dense prediction task that infers per-pixel depth from a single image, fundamental to 3D perception and robotics. Although real-world scenes exhibit strong structure, most methods treat it as an independent pixel-wise regression problem, often resulting in structural inconsistencies in depth maps, such as ambiguous object shapes. We propose SHED, a novel encoder-decoder architecture that incorporates segmentation into dense prediction. Inspired by the bidirectional hierarchical reasoning in human perception, SHED improves upon DPT by replacing fixed patch tokens with segment tokens, which are hierarchically pooled in the encoder and unpooled in the decoder to reverse the hierarchy. The model is supervised only at the final output, and the intermediate segment hierarchy emerges naturally without explicit supervision. SHED offers three key advantages over DPT. First, it improves depth boundaries and segment coherence while reducing computational cost. Second, it enables features and segments to better capture global scene layout. Third, it enhances 3D reconstruction and reveals part structures that conventional pixel-wise methods fail to capture.
Supplementary Material: pdf
Submission Number: 19
Loading