Abstract: Monocular depth estimation based on event cameras has attracted widespread attention of researchers as event-cameras, with their high dynamic range and temporal resolution, can offer enhanced environmental perception ability under challenging lighting conditions. However, due to the inherent texture sparsity of event cameras, existing event-only methods are not performing well. In this paper, we introduce EDE-Distill, a cross-modal knowledge distillation framework that harnesses the rich feature representations of image-based models to enhance the performance of event-only networks without additional computational cost. In our backbone network, the model not only predicts the depth but also generates a corresponding depth uncertainty map, which guides iterative refinement of the depth estimation. Our distillation strategy incorporates an attention mechanism and uncertainty map-based confidence information, enabling the event network to autonomously determine distillation weights at both feature and output levels, thus optimizing the distilled feature quality. Our comparative analysis on the MVSEC and DSEC datasets shows that EDE-Distill achieves SOTA performance in pure event-based monocular depth estimation. Additionally, it demonstrates competitive results compared to frame-event fusion methods, with particularly notable improvements on MVSEC and DSEC nighttime driving sequences over other SOTA algorithms.
External IDs:doi:10.1109/lra.2025.3583486
Loading