HAFUNet: A Hierarchical Attention Fusion Network for Monocular Depth Estimation Integrating Event and Frame Data
Abstract: In robotics and autonomous driving, accurate depth estimation is vital yet challenging under dynamic scenes and extreme lighting.
Conventional frame-based cameras offer rich context but suffer from motion blur and limited dynamic range, while event cameras provide high temporal resolution and dynamic range but lack global scene structure. Therefore, recent studies explore frame-event fusion depth estimation methods to leverage these two complementary modalities to achieve robust performance. However,
due to the mismatch in temporal and spatial resolution, there is an inherent contradiction between high spatial resolution frames captured at sparse temporal intervals and event streams characterized by spatial sparsity but high temporal resolution, rendering cross-modal feature fusion ineffective. Moreover, the limited availability of frame-event depth datasets further undermines the model’s generalization capability across different scenes. To address the above challenges, we propose HAFUNet, a Hierarchical Attention Fusion
Network for depth estimation via frame-event fusion. Our method contains: (1) a pre-trained Dual-Stream Encoder (DSEer) to extract
complementary features from frame and event inputs; (2) a Cross-modal Feature Interaction Module (CFIM) that aligns and fuses
spatial-channel features across modalities; and (3) a Hierarchical Attention Decoder (HADer) that progressively refines depth predictions via attention-guided convolution. Experiments on synthetic and real-world datasets show that HAFUNet surpasses existing methods in depth accuracy and robustness. These results demonstrate the strength of our fusion strategy in diverse environments.
Code is available at https://github.com/SiYZhangwh/HAFUNet.
Loading