Abstract: Monocular depth estimation is an ill-posed problem as
the same 2D image can be projected from infinite 3D scenes.
Although the leading algorithms in this field have reported
significant improvement, they are essentially geared to the
particular compound of pictorial observations and camera
parameters (i.e., intrinsics and extrinsics), strongly limiting their generalizability in real-world scenarios. To cope
with this challenge, this paper proposes a novel ground
embedding module to decouple camera parameters from
pictorial cues, thus promoting the generalization capability. Given camera parameters, the proposed module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction. A ground
attention is designed in the module to optimally combine
ground depth with residual depth. Our ground embedding
is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks. Experiments reveal that our approach
achieves the state-of-the-art results on popular benchmarks,
and more importantly, renders significant generalization
improvement on a wide range of cross-domain tests.
Loading