Exploring Vision Semantic Prompt for Efficient Point Cloud Understanding

Yixin Zha; Chuxin Wang; Wenfei Yang; Tianzhu Zhang; Feng Wu

Exploring Vision Semantic Prompt for Efficient Point Cloud Understanding

Yixin Zha, Chuxin Wang, Wenfei Yang, Tianzhu Zhang, Feng Wu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose a new paradigm that, for the first time, leverages 2D semantics as prompts to improve the generalization of pretrained 3D models with minimal trainable parameters.

Abstract: A series of pre-trained models have demonstrated promising results in point cloud understanding tasks and are widely applied to downstream tasks through fine-tuning. However, full fine-tuning leads to the forgetting of pretrained knowledge and substantial storage costs on edge devices. To address these issues, Parameter-Efficient Transfer Learning (PETL) methods have been proposed. According to our analysis, we find that existing 3D PETL methods cannot adequately align with semantic relationships of features required by downstream tasks, resulting in suboptimal performance. To ensure parameter efficiency while introducing rich semantic cues, we propose a novel fine-tuning paradigm for 3D pre-trained models. We utilize frozen 2D pre-trained models to provide vision semantic prompts and design a new Hybrid Attention Adapter to efficiently fuse 2D semantic cues into 3D representations with minimal trainable parameters(1.8M). Extensive experiments conducted on datasets including ScanObjectNN, ModelNet40, and ShapeNetPart demonstrate the effectiveness of our proposed paradigm. In particular, our method achieves 95.6% accuracy on ModelNet40 and attains 90.09% performance on the most challenging classification split ScanObjectNN(PB-T50-RS).

Lay Summary: 3D point clouds are crucial for applications like robotics and autonomous driving, but current methods face challenges in efficiently transferring pre-trained knowledge across tasks without losing important information or requiring large computational resources. We introduce a novel approach that combines 2D image semantics with 3D point clouds to enhance model performance while minimizing the need for additional resources. By using pre-trained 2D models to generate "semantic prompts," our method helps 3D models generalize better to new tasks with parameter-efficient fine-tuning. Our experiments show that this approach significantly improves performance in tasks like object classification and part segmentation, achieving state-of-the-art results with fewer resources. This solution not only enhances accuracy but also makes it easier to deploy powerful models on devices with limited computational capabilities.

Primary Area: Applications->Computer Vision

Keywords: point cloud understanding; point cloud analysis; 3D computer vision

Submission Number: 278

Loading