CasFormer: Cascaded Transformer Based on Dynamic Voxel Pyramid for 3D Object Detection from Point Clouds

Xinglong Li, Xiaowei Zhang

Published: 2023, Last Modified: 13 Nov 2024PRCV (3) 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recently, Transformers have been widely applied in 3-D object detection to model global contextual relationships in point cloud collections or for proposal refinement. However, the structural information in 3-D point clouds, especially to the distant and small objects is often incomplete, leading to difficulties in accurate detection using these methods. To address this issue, we propose a Cascaded Transformer based on Dynamic Voxel Pyramid (called CasFormer) for 3-D object detection from LiDAR point clouds. Specifically, we dynamically spread relevant features from the voxel pyramid based on the sparsity of each region of interest (RoI), capturing more rich semantic information for structurally incomplete objects. Furthermore, a cross-stage attention mechanism is employed to cascade the refined results of the Transformer in stage by stage, as well as to improve the training convergence of transformer. Extensive experiments demonstrate that our CasFormer achieves progressive performance in KITTI Dataset and Waymo Open Dataset. Compared to CT3D, our method outperforms it by 1.12% and 1.27% in the moderate and hard levels of car detection, respectively, on the KITTI online 3-D object detection leaderboard.