Transferring Relative Monocular Depth to Surgical Vision with Temporal Consistency

Charlie Budd, Tom Vercauteren

Published: 03 Oct 2024, Last Modified: 05 Mar 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Relative monocular depth, inferring depth up to shift and scale from a single image, is an active research topic. Recent deep learn- ing models, trained on large and varied meta-datasets, now provide excel- lent performance in the domain of natural images. However, few datasets exist which provide ground truth depth for endoscopic images, making training such models from scratch unfeasible. This work investigates the transfer of these models into the surgical domain, and presents an ef- fective and simple way to improve on standard supervision through the use of temporal consistency self-supervision. We show temporal consis- tency significantly improves supervised training alone when transferring to the low-data regime of endoscopy, and outperforms the prevalent self- supervision technique for this task. In addition we show our method dras- tically outperforms the state-of-the-art method from within the domain of endoscopy. We also release our code, models, and ensembled meta- dataset, Meta-MED, establishing a strong benchmark for future work.