Multi-Event Representation and Multi-Level Fusion for Robust RGB-Event Object Tracking
Abstract: Integrating RGB frames with event data enables cross-modal object tracking by leveraging motion cues from event cameras and texture cues from RGB cameras. However, existing RGB-event tracking methods often suffer from limited event information extraction and insufficient cross-modal interaction due to single-level fusion, resulting in performance bottlenecks in object tracking. To address these challenges, we propose a novel three-branch architecture that leverages multi-event representations and multi-level fusion to achieve robust RGB-event tracking. Specifically, we first combine two complementary event representations, event frames and time surfaces, to comprehensively capture the spatio-temporal context of moving objects. Second, an Interactive Enhancement and Adaptive (IEA) module is designed, which is based on the attention mechanism, to efficiently and adaptively facilitate information interaction and feature integration across different modalities. Third, we introduce a Hybrid Feature Transformation and Fusion Strategy (HFTFS) that simultaneously extracts and integrates composite, differential, and conjoint features during the fusion process, thereby enabling robust cross-modal information integration. Finally, by progressively performing intra- and inter-modal feature fusion, more informative feature representations are generated. Extensive experiments on several challenging datasets demonstrate that our method achieves state-of-the-art performance. Ablation studies further validate the contributions of each component. The code will be made publicly available.
Loading