Self-supervised Monocular Trained Depth Estimation using Self-attention and Discrete Disparity Volume - ML Reproducibility Challenge 2020Download PDF

Jan 31, 2021 (edited Apr 01, 2021)ML Reproducibility Challenge 2020 Blind SubmissionReaders: Everyone
  • Keywords: unsupervised, monocular depth estimation, self-attention, discreate disparity volume, ordinal regression
  • Abstract: Depth estimation is one of the widely studied applications in the field of computer vision as it captures the perception of the 3D world. This knowledge can be used in applications like autonomous vehicles and obstacle warning systems. The paper [1] proposes to estimate the depth by a self-supervised technique where the model is trained using a sequence of monocular input images. Scope of Reproducibility The state-of-the-art results in depth estimation are mostly based on fully-supervised techniques. Although recent works in unsupervised techniques have shown promising results, the performance gap is still prominent and they mostly rely on stronger supervision signals (stereo-supervision or monocular+stereo supervision). The paper proposes to close this performance gap with the fully-supervised methods using only the monocular sequence for training with the help of additional layers - self-attention and discrete disparity volume. Methodology The architecture proposed in the paper was implemented from scratch since the code is not available online. Ablation study, additional experiments using hyperparameters were performed and the results obtained from reproduction were compared with that claimed in the paper. The experiments were trained and tested using GeForce GTX 1080 GPU. Results We reproduced the performance proposed in the paper to within 2% of the reported values. Therefore, it can be inferred that the use of self-attention and discrete disparity volume helped in improving the performance using only the monocular sequence of images. What was easy Reproduction of the encoder in the proposed algorithm was easily handled. The description of the self-attention layer is elaborate and could be conveniently reproduced. What was difficult The unavailability of the code online made it inevitable for the reproduction of the algorithm from scratch. The process of training using the ResNet101 encoder was time-consuming, however, the memory utilization was reduced to half due to the usage of activated batch normalization. Communication with original authors The authors helped in understanding concepts like the usage of atrous spatial pyramid pooling and discrete disparity volume. This played a crucial role in reproducing the results reported in the paper. References [1] A. Johnston and G. Carneiro. Self-supervised monocular trained depth estimation using self-attention and discrete disparity volume. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4755–4764, 2020. doi: 10.1109/CVPR42600.2020.00481.
  • Paper Url:
  • Supplementary Material: zip
3 Replies