Abstract: Transformer-based architectures have been proven successful in detecting 3D
objects from point clouds. However, the quadratic complexity of the attention
mechanism struggles to encode rich information as point cloud resolution increases.
Recently, state space models (SSM) such as Mamba have gained great attention
due to their linear complexity and long sequence modeling ability for language
understanding. To exploit the potential of Mamba on 3D scene-level perception,
for the first time, we propose 3DET-Mamba, which is a novel SSM-based model
designed for indoor 3D object detection. Specifically, we divide the point cloud
into different patches and use a lightweight yet effective Inner Mamba to capture
local geometric information. To observe the scene from a global perspective,
we introduce a novel Dual Mamba module that models the point cloud in terms
of spatial distribution and continuity. Additionally, we design a Query-aware
Mamba module that decodes context features into object sets under the guidance of
learnable queries. Extensive experiments demonstrate that 3DET-Mamba surpasses
previous 3DETR on indoor 3D detection benchmarks such as ScanNet, improving
AP@0.25/AP@0.50 from 65.0%/47.0% to 70.4%/54.4%, respectively.
Loading