MonoRetNet: A Self-supervised Model for Monocular Depth Estimation with Bidirectional Half-Duplex Retention
Abstract: Self-supervised monocular depth estimation, which does not require ground truth, is a cost-effective training strategy. Existing depth estimation networks primarily utilize CNNs, Transformers, or a combination of both. However, they still face the problems of texture bias or huge overhead. To address it, we incorporate RetNet into the depth estimation task and design a novel self-supervised framework, MonoRetNet. Decomposing the image along two axes and introducing the concepts of consistent depth and continuous depth, which function as decay factors in the vertical and horizontal directions, respectively. This enables us to successfully retain the retention mechanism's sophisticated unidirectional causal decay characteristics and apply it to depth estimation tasks. Experiments on the KITTI dataset demonstrate that at the smallest model size, MonoRetNet achieves performance comparable to, or even superior to, Lite-Mono, with about 32% reduction in parameters and about 14% reduction in FLOPs. This work presents a significant advancement in the application of RetNet to visual tasks, particularly depth estimation.
Loading