S3M3D: Cross-Modal Selective State Space Modulation for 3D Object Detection

Jiaxi Cao; Zhiwei Ning; Huiying Xu; Xinzhong Zhu; JIE YANG; Wei Liu

S3M3D: Cross-Modal Selective State Space Modulation for 3D Object Detection

Jiaxi Cao, Zhiwei Ning, Huiying Xu, Xinzhong Zhu, JIE YANG, Wei Liu

18 Sept 2025 (modified: 16 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: 3D Object Detection, Mamba, Multimodal

Abstract: Multi-modal 3D object detection is critical for autonomous driving. However, prevailing query-based methods suffer from a symmetric fusion bottleneck, treating geometrically precise LiDAR queries and uncertain camera queries with equal reliability. This overlooks the opportunity to use high-fidelity LiDAR queries to guide the interpretation of noise-prone camera queries. To address this, we propose Selective State Space Modulation for 3D object detection (S3M3D), a novel framework applying two synergistic and Mamba-based components to redefine intra- and inter-modality interactions. First, we introduce Spatially-Aware Mamba (SA-Mamba) to model interactions among LiDAR queries, replacing self-attention. It efficiently captures geometric priors by leveraging recursive Z-order serialization of their projected BEV coordinates. Second, we propose LiDAR-Guided Mamba (LG-Mamba) to establish an asymmetric guidance mechanism, where the robust LiDAR queries dynamically modulate the state-space processing of the less reliable camera queries. This allows geometric structure to actively steer semantic feature refinement. Extensive experiments on nuScenes demonstrate that S3M3D achieves state-of-the-art performance.

Supplementary Material: pdf

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 12152

Loading