Fixing Defect of Photometric Loss for Self-Supervised Monocular Depth Estimation
Abstract: View-synthesis-based methods have shown very
promising results for the task of unsupervised depth estimation
in single images. Most existing approaches synthesize a new
image and employ it as the supervision signal for depth and
pose prediction. There are two problems in these approaches:
1) There are many combinations of pose and depth that can
synthesize a certain new image; therefore, reconstructing the
depth and pose based on the view-synthesis method from only
two images is an inherently ill-posed problem; 2) The model
is trained under the photometric consistency assumption that
the brightness or gradient is constant when applied to the
video sequences. However, this assumption is easily violated
in realistic scenes due to light changes, reflective surfaces and
occlusions. To overcome the first drawback, we exploit the point
cloud consistency constraint to eliminate ambiguity. To overcome
the second drawback, we use threshold masks to filter dynamic
and occluded points and introduce matching point constraints
that implicitly encode the geometry relationship between two
matched points to improve the precision of depth prediction.
In addition, we employ epipolar constraints to compensate for
the instability of the photometric error in textureless regions and
varying illumination conditions. The experimental results on the
KITTI, Cityscapes and NYUv2 datasets show that the method
can improve the accuracy of depth prediction and enhance
the robustness of the model in handling textureless regions
and illumination changes. The code and data are available at
https://github.com/XTUPRLAB/FixUnDepth.
Loading