Abstract: We propose a two-stage pre-training approach using point clouds for a diverse set of 3D understanding tasks. In the first stage, we pre-train the 3D encoder to acquire knowledge from the other modalities such as vision and language. This stage aligns 3D representations with multiple modalities by leveraging several pre-trained foundation models, unlike the current cross-modal paradigm that typically uses only a single pre-trained model. In the second stage, the pre-training approach is improved upon masked point modeling by global-local feature distillation of semantic 3D embeddings and token shuffling approach. These techniques enable the model to focus on the 3D modality while leveraging the multimodal information associated with the point clouds. This pre-training approach is model-agnostic and can be applied to any 3D transformer encoder. We conduct extensive experiments on a wide range of 3D understanding tasks, from synthetic and real-world object recognition to indoor semantic segmentation and object detection, achieving state-of-the-art results. For instance, on the ScanObjectNN variants, our approach achieves $\textbf{96.1\%}$, $\textbf{94.2\%}$ and $\textbf{91.2\%}$ accuracy using multi-scale 3D encoder proposed in Point-M2AE.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xuming_He3
Submission Number: 6767
Loading