How Do Vision Transformers See Depth in Single Images?

Published: 07 Apr 2023, Last Modified: 07 Apr 2023ICLR 2023 Workshop SR4AD HYBRIDReaders: Everyone
TL;DR: Blog post at https://sr4ad-vit-mde.github.io/blog/2023/visual-cues-monocular-depth-estimation/
Abstract: Neural networks for monocular depth estimation have been improving greatly in recent years. Unfortunately, not much is understood about the visual cues that these networks use to make a pixel-wise depth estimation from a single image. In this blog post, we take a second look at the publication by van Dijk and de Croon on "How do neural networks see depth in single images?". The original work made use of carefully constructed synthetic images to gain a better understanding on the visual cues that MonoDepth uses to make its depth estimation. We expand on this work by reproducing the original results on MonoDepth and re-running the experiments on the transformer-based model DPT. We make use of the blog format to allow for better interactivity with the experiments and their data. Our results show that DPT greatly improves over MonoDepth's original performance on a number of experiments.
Track: Research Insight
Type: Blog Post
0 Replies

Loading