Abstract: This paper revisits the development of generative self-supervised learning in 2D images and 3D point clouds in autonomous driving. In 2D images, the pretext task has evolved from low-level to high-level features. Inspired by this, through explore model analysis, we find that the gap in weight distribution between self-supervised learning and supervised learning is substantial when employing only low-level features as the pretext task in 3D point clouds. Low-level features represented by PoInt Cloud reconsTruction are insUfficient to learn 3D REpresentations (dubbed PICTURE). To advance the development of pretext tasks, we propose a unified generative self-supervised framework. Firstly, high-level features represented by the Seal features are demonstrated to exhibit semantic consistency with downstream tasks. We utilize the Seal voxel features as an additional pretext task to enhance the understanding of semantic information during the pre-training. Next, we propose inter-class and intra-class discrimination-guided masking (I$^2$Mask) based on the attributes of the Seal voxel features, adaptively setting the masking ratio for each superclass. On Waymo and nuScenes datasets, we achieve 75.13\% mAP and 72.69\% mAPH for 3D object detection, 79.4\% mIoU for 3D semantic segmentation, and 18.4\% mIoU for occupancy prediction. Extensive experiments have demonstrated the effectiveness and necessity of high-level features. The project page is available at https://anonymous-picture.github.io/.
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Content] Media Interpretation, [Generation] Multimedia Foundation Models
Relevance To Conference: In autonomous driving scenarios, two common sensors are cameras and LiDAR, corresponding to two modalities of input signals: RGB images and point clouds. This paper investigates the task of self-supervised learning in autonomous driving scenarios based on point clouds. We find that the RGB semantic concepts contained in powerful vision foundation models can provide high-quality self-supervised signals for point cloud scenarios. Pre-training based on multimodal fusion is expected to become an important method for scene understanding in the field of autonomous driving.
Supplementary Material: zip
Submission Number: 795
Loading