High-Rate Monocular Depth Estimation via Cross Frame-Rate Collaboration of Frames and Events

Published: 01 Jan 2025, Last Modified: 06 Nov 2025Int. J. Comput. Vis. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Combining the complementary benefits of frames and events has been widely used for monocular depth estimation in challenging scenarios. However, most existing methods utilize a synchronous fusion of two modalities, ignoring the advantages of high temporal resolution from event cameras, which results in low-rate depth maps constrained by the frame sampling rate. To this end, this paper proposes a novel cross frame-rate frame-event joint learning network, namely CFRNet, collaborating two heterogeneous streams for high-rate and fine-grained monocular depth estimation. Technically, a cross frame-rate multimodal fusion (CFMF) module is first designed for the joint representation of frames and events. By employing implicit spatial alignment and dynamic attention-based fusion, it addresses the misalignment between frames and events at different moments, robustly combining the strengths of both modalities in diverse challenging scenarios. Following the CFMF, a temporal consistent modeling (TCM) module adopting the recurrent structure is created to keep the temporal consistency of joint representations from CFMF. Experimental results demonstrate that the depth estimation accuracy of our approach outperforms existing five state-of-the-art methods and our three baselines involving single modality on two public datasets (i.e., DSEC and MVSEC) while achieving a high frame rate up to 100 Hz. Codes can be available at https://github.com/liuxu0303/CFRNet.
Loading