SCNet: Spatio-temporal Feature Aggregation and Cross-modal Interactive Encoding Network for DAVIS Object

Yunhua Chen

Published: 25 Jun 2025, Last Modified: 26 Feb 2026ICMREveryoneWM2024 Conference

Abstract: Abstract DAVIS cameras, which output both event streams and frames simultaneously, are increasingly being used to address the primary object detection challenges posed by complex lighting and motion blur. Nevertheless, fully leveraging the abundant temporal information and effectively fusing data from these two modalities remains a formidable challenge. In this paper, we first design a multi-scale spatio-temporal aggregation (MSTA) module to distill richer semantic information from event frames. Secondly, we assimilate and harness the strengths of YOLOv8 and RT-DETR to develop an innovative encoder with Multi-scale Cross-modal dynamic Interactive fusion and multi-level feature interactive Fusion (MCIF). In MCIF, we propose a dynamic channel switching and spatial attention with learnable fusing factors (DCF-CSSA) to improve the complementary interaction of cross-modal features. Extensive experiments demonstrate that our approach (which we call SCNet) significantly outperforms existing state-of-the-art (SOTA) object detection methods that fuse events and frames, achieving an mAP50 improvement of 6.2% on PKU-DAVIS-SOD and 12% on DESC-MOD, both contain a large number of samples with challenging lighting conditions and motion blur.