SparseInteraction: Sparse Semantic Guidance for Radar and Camera 3D Object Detection

Published: 20 Jul 2024, Last Modified: 30 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Multi-modal fusion techniques, such as radar and images, enable a complementary and cost-effective perception of the surrounding environment regardless of lighting and weather conditions. However, existing fusion methods for surround-view images and radar are challenged by the inherent noise and positional ambiguity of radar, which leads to significant performance losses. To address this limitation effectively, our paper presents a robust, end-to-end fusion framework dubbed SparseInteraction. First, we introduce the Noisy Radar Filter (NRF) module to extract foreground features by creatively using queried semantic features from the image to filter out noisy radar features. Furthermore, we implement the Sparse Cross-Attention Encoder (SCAE) to effectively blend foreground radar eatures and image features to address positional ambiguity issues at a sparse level. Ultimately, to facilitate model convergence and performance, the foreground prior queries containing position information of the foreground radar are concatenated with predefined queries and fed into the subsequent transformer-based decoder. The experimental results demonstrate that the proposed fusion strategies markedly enhance detection performance and achieve new state-of-the-art results on the nuScenes benchmark. Source code is available at https://github.com/GG-Bonds/SparseInteraction.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: Our research presents SparseInteraction, an advanced fusion framework that synergizes radar and image data to address environmental perception challenges. By innovatively overcoming radar noise and positional ambiguity, our method enhances detection capabilities under varied lighting and weather conditions, crucial for autonomous navigation systems. The core of SparseInteraction lies in two novel components: the Noisy Radar Filter (NRF) and the Sparse Cross-Attention Encoder (SCAE). NRF utilizes image-derived semantic features to refine radar data, while SCAE merges radar and image inputs, correcting positional inaccuracies. This approach not only tackles key issues in multimodal fusion but also achieves unprecedented performance on the nuScenes benchmark, establishing new state-of-the-art results. Our work is a testament to the power of multimodal research in multimedia fields, demonstrating how combining diverse data types—radar and visual—can lead to significant advancements. It aligns with the ACM MM conference’s aim to spotlight research inherently multimodal and multimedia in nature, marking a substantial contribution to the multimedia and multimodal processing community.
Submission Number: 4729
Loading