Multi-Modal Fusion of Event and RGB for Monocular Depth Estimation Using a Unified Transformer-based Architecture

Published: 01 Jan 2024, Last Modified: 15 Apr 2025CVPR Workshops 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In the field of robotics and autonomous navigation, accurate pixel-level depth estimation has gained significant importance. Event cameras or dynamic vision sensors, capture asynchronous changes in brightness at the pixel level, offering benefits such as high temporal resolution, no motion blur, and a wide dynamic range. However, unlike traditional cameras that measure absolute intensity, event cameras lack the ability to provide scene context. Efficiently combining the advantages of both asynchronous events and synchronous RGB images to enhance depth estimation remains a challenge. In our study, we introduce a unified transformer that combines both event and RGB modalities to achieve precise depth prediction. In contrast to individual transformers for input modalities, a unified transformer model captures inter-modal dependencies and uses self-attention to enhance event-RGB contextual interactions. This approach exceeds the performance of recurrent neural network (RNN) methods used in state-of-the-art models. To encode the temporal information from events, convLSTMs are used before the transformer to improve depth estimation. Our proposed architecture outperforms the existing approaches in terms of absolute mean depth error, achieving state-of-the-art results in most cases. Additionally, the performance is also seen in other metrics like RMSE, absolute relative difference and depth thresholds compared to the existing approaches. The source code is available at: https://github.com/anusha-devulapally/ER-F2D.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview