Keywords: Point Cloud, 3D Understanding, Multi-Modal LLM
Abstract: In computer-aided design (CAD) and engineering, understanding complex CAD models remains a critical challenge. Existing methods struggle with integrating geometric features due to the lack of 3D modality and the difficulty of modal fusion. To address this, we introduce PointVLM, a novel multi-modal vision-language model that bridges 3D point cloud processing with vision and natural language understanding to enable precise CAD model interpretation. PointVLM leverages a 3D encoder to grasp 3D features from the point cloud of the object in addition to vision and language modalities. By combining Qwen2.5-VL architecture, PointVLM fuses three kinds of modality features using a learnable projector module, enabling context-aware interactions between geometric and semantic properties. We further build a pipeline which takes CAD file and instruction as input, automatically samples point clouds and renders multi-view images, and finally outputs responses. Experiments show that PointVLM outperforms existing methods on both generative 3D object classification and 3D object captioning tasks. The source code and pre-trained models will be available at MASKED_URL.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 3365
Loading