Sparse Query Dense: Enhancing 3D Object Detection with Pseudo points

Mo Yujian; Yan Wu; Junqiao Zhao; Hou zhenjie; weiquan Huang; Hu Yinghao; Jijun Wang; Jun Yan

Sparse Query Dense: Enhancing 3D Object Detection with Pseudo points

Mo Yujian, Yan Wu, Junqiao Zhao, Hou zhenjie, weiquan Huang, Hu Yinghao, Jijun Wang, Jun Yan

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Current LiDAR-only 3D detection methods are limited by the sparsity of point clouds. The previous method used pseudo points generated by depth completion to supplement the LiDAR point cloud, but the pseudo points sample process was complex, and the distribution of pseudo points was uneven. Meanwhile, due to the imprecision of depth completion, the pseudo points suffer from noise and local structural ambiguity, which limit the further improvement of detection accuracy. This paper presents SQDNet, a novel framework designed to address these challenges. SQDNet incorporates two key components: the SQD, which achieves sparse-to-dense matching via grid position indices, allowing for rapid sampling of large-scale pseudo points on the dense depth map directly, thus streamlining the data preprocessing pipeline. And use the density of LiDAR points within these grids to alleviate the uneven distribution and noise problems of pseudo points. Meanwhile, the sparse 3D Backbone is designed to capture long-distance dependencies, thereby improving voxel feature extraction and mitigating local structural blur in pseudo points. The experimental results validate the effectiveness of SQD and achieve considerable detection performance for difficult-to-detect instances on the KITTI test.

Primary Subject Area: [Content] Multimodal Fusion

Secondary Subject Area: [Content] Multimodal Fusion

Relevance To Conference: We integrate data from multimodal data (lidar and camera) through a complex fusion process to enhance object detection capabilities. Unlike feature or decision-level fusion, our strategy emphasizes data-level fusion. We align LiDAR point clouds and image data in a unified space, using dense image data to supplement the sparse LiDAR point cloud. Through depth completion, we convert 2D image data into the 3D space where LiDAR point clouds are located, resulting in many pseudo points, usually exceeding 300K, far exceeding the number of LiDAR points. To effectively manage pseudo points, we have developed a fast sampling method. However, the inherent inaccuracy of pseudo points introduces ambiguity issues in local structures. To address this challenge, we have introduced a sparse Backbone designed explicitly for robust feature extraction from rich point clouds. These innovations demonstrate a sophisticated integration of multimodal data and propose practical solutions to the inherent challenges of multimodal data processing, thereby contributing to the advancement of multimodal processing techniques.

Submission Number: 3879

Loading