3D Single Object Tracking With Cross-Modal Fusion Conflict Elimination

Published: 01 Jan 2025, Last Modified: 06 Nov 2025IEEE Robotics Autom. Lett. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: 3D single object tracking based on point clouds is a key challenge in robotics and autonomous driving technology. Mainstream methods rely on point clouds for geometric matching or motion estimation between the target template and the search area. However, the lack of texture and the sparsity of incomplete point clouds make it difficult for unimodal trackers to distinguish objects with similar structures. To overcome the limitations of previous methods, this letter proposes a cross-modal fusion conflict elimination tracker (CCETrack). The point clouds collected by LiDAR provide accurate depth and shape information about the surrounding environment, while the camera sensor provides RGB images containing rich semantic and texture information. CCETrack fully leverages both modalities to track 3D objects. Specifically, to address cross-modal conflicts caused by heterogeneous sensors, we propose a global context alignment module that aligns RGB images with point clouds and generates enhanced image features. Then, a sparse feature enhancement module is designed to optimize voxelized point cloud features using the rich image features. In the feature fusion stage, both modalities are converted into BEV features, with the template and search area features fused separately. A self-attention mechanism is employed to establish bidirectional communication between regions. Our method maximizes the use of effective information and achieves state-of-the-art performance on the KITTI and nuScenes datasets through multimodal complementarity.
Loading