Multimodal Masked Point Distillation for 3D Representation Learning

02 Dec 2025 (modified: 28 Apr 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We propose a two-stage pre-training approach using point clouds for a diverse set of 3D understanding tasks. In the first stage, we pre-train the 3D encoder to acquire knowledge from the other modalities such as vision and language. This stage aligns 3D representations with multiple modalities by leveraging several pre-trained foundation models, unlike the current cross-modal paradigm that typically uses only a single pre-trained model. In the second stage, the pre-training approach is improved upon masked point modeling by global-local feature distillation of semantic 3D embeddings and token shuffling approach. These techniques enable the model to focus on the 3D modality while leveraging the multimodal information associated with the point clouds. This pre-training approach is model-agnostic and can be applied to any 3D transformer encoder. We conduct extensive experiments on a wide range of 3D understanding tasks, from synthetic and real-world object recognition to indoor semantic segmentation and object detection, achieving state-of-the-art results. For instance, on the ScanObjectNN variants, our approach achieves $\textbf{96.1\%}$, $\textbf{94.2\%}$ and $\textbf{91.2\%}$ accuracy using multi-scale 3D encoder proposed in Point-M2AE.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In the revised manuscript, we clarified the positioning and novelty of our approach by adding a dedicated section “Comparison with Prior Two-Stage and Multimodal Methods” (section 3) following the Related Work section, where we explicitly contrast our framework with prior methods (e.g., ULIP-2, ACT, I2P-MAE, and ReCon). In the Approach section, we expanded the Stage-1 description by introducing a new paragraph on Multi-Teacher Multimodal Alignment, clarifying its motivation and role in enabling Stage-2 distillation, with supporting ablation references (Appendix B). In the Experiments section, we incorporated additional results demonstrating generalization across architectures, including PointMamba, and updated the main comparison table (Table 3) accordingly. We also added comparisons with parameter-efficient fine-tuning (PointGST) to provide a stronger evaluation baseline. To improve clarity and maintain conciseness, we reorganized the experimental section by moving extended results on part segmentation and object detection, as well as detailed ablations and implementation details, to Appendix, while retaining a brief summary in the main text. Finally, we refined and expanded the Discussion and Limitations section to explicitly address training cost, reliance on synthetic pre-training data, scalability to larger datasets and models, multimodal supervision design (including BLIP-2–based captions and fixed prompt templates), and applicability to large-scale LiDAR settings, as well as limitations of the fixed multi-view rendering strategy in capturing occluded geometry. We also outline corresponding future directions, including incorporating richer language supervision, improving training efficiency, and exploring adaptive or multi-view rendering strategies.
Assigned Action Editor: ~Xuming_He3
Submission Number: 6767
Loading