Keywords: 3D Object Detection, Mamba, Multimodal
Abstract: Multi-modal 3D object detection is critical for autonomous driving. However, prevailing query-based methods suffer from a symmetric fusion bottleneck, treating geometrically precise LiDAR queries and uncertain camera queries with equal reliability. This overlooks the opportunity to use high-fidelity LiDAR queries to guide the interpretation of noise-prone camera queries. To address this, we propose Selective State Space Modulation for 3D object detection (S3M3D), a novel framework applying two synergistic and Mamba-based components to redefine intra- and inter-modality interactions. First, we introduce Spatially-Aware Mamba (SA-Mamba) to model interactions among LiDAR queries, replacing self-attention. It efficiently captures geometric priors by leveraging recursive Z-order serialization of their projected BEV coordinates. Second, we propose LiDAR-Guided Mamba (LG-Mamba) to establish an asymmetric guidance mechanism, where the robust LiDAR queries dynamically modulate the state-space processing of the less reliable camera queries. This allows geometric structure to actively steer semantic feature refinement. Extensive experiments on nuScenes demonstrate that S3M3D achieves state-of-the-art performance.
Supplementary Material: pdf
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 12152
Loading