Cross-Level Fusion: Integrating Object Lists with Raw Sensor Data for 3D Object Tracking

Xiangzhong Liu, Xihao Wang, Hao Shen

Published: 2025, Last Modified: 05 Mar 2026IROS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Smart sensors and Vehicle-To-Everything (V2X) modules are commonly utilized in automotive perception systems, which primarily provide processed object lists rather than raw data. However, high-level fusion approaches suffer from significant information loss and representational misalignment due to the inherently abstract and sparse nature of these high-level outputs. We propose a novel cross-level fusion paradigm that enables bidirectional information flow between object lists and raw vision features within an end-to-end Transformer framework for 3D object detection and tracking. Our approach extracts inherent positional and dimensional cues from object lists to generate two outputs: structured query features that are fused with the initial learnable queries in the Transformer decoder, and soft Gaussian attention masks that guide feature extraction. This integrated mechanism not only improves tracking accuracy by synergistically combining object priors with fine-grained vision data but also promotes hardware economy and AI model sustainability by adapting legacy sensors to evolving sensor setups. To overcome the lack of dedicated datasets, we develop a pseudo object list generation pipeline that simulates realistic sensor tracking behavior. Experiments on the nuScenes dataset demonstrate significant performance gains over vision-only baselines and robust generalization across diverse noise levels, validating the efficacy of our cross-level fusion strategy. The code is available at: https://github.com/CesarLiu/DNF.git.
Loading