SCNet: Spatio-temporal Feature Aggregation and Cross-modal Interactive Encoding Network for DAVIS Object
Abstract: Abstract
DAVIS cameras, which output both event streams and frames simultaneously, are increasingly being used to address the primary object
detection challenges posed by complex lighting and motion blur.
Nevertheless, fully leveraging the abundant temporal information
and effectively fusing data from these two modalities remains a
formidable challenge. In this paper, we first design a multi-scale
spatio-temporal aggregation (MSTA) module to distill richer semantic information from event frames. Secondly, we assimilate and
harness the strengths of YOLOv8 and RT-DETR to develop an innovative encoder with Multi-scale Cross-modal dynamic Interactive
fusion and multi-level feature interactive Fusion (MCIF). In MCIF,
we propose a dynamic channel switching and spatial attention with
learnable fusing factors (DCF-CSSA) to improve the complementary interaction of cross-modal features. Extensive experiments
demonstrate that our approach (which we call SCNet) significantly
outperforms existing state-of-the-art (SOTA) object detection methods that fuse events and frames, achieving an mAP50 improvement
of 6.2% on PKU-DAVIS-SOD and 12% on DESC-MOD, both contain
a large number of samples with challenging lighting conditions and
motion blur.
Loading