PointVLM: Multi-Modal Vision-Language Model for CAD Model Understanding via Point Cloud Integration

Fangjun Wang; Nan Zhang; Zhiming Tan

PointVLM: Multi-Modal Vision-Language Model for CAD Model Understanding via Point Cloud Integration

Fangjun Wang, Nan Zhang, Zhiming Tan

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Point Cloud, 3D Understanding, Multi-Modal LLM

Abstract: In computer-aided design (CAD) and engineering, understanding complex CAD models remains a critical challenge. Existing methods struggle with integrating geometric features due to the lack of 3D modality and the difficulty of modal fusion. To address this, we introduce PointVLM, a novel multi-modal vision-language model that bridges 3D point cloud processing with vision and natural language understanding to enable precise CAD model interpretation. PointVLM leverages a 3D encoder to grasp 3D features from the point cloud of the object in addition to vision and language modalities. By combining Qwen2.5-VL architecture, PointVLM fuses three kinds of modality features using a learnable projector module, enabling context-aware interactions between geometric and semantic properties. We further build a pipeline which takes CAD file and instruction as input, automatically samples point clouds and renders multi-view images, and finally outputs responses. Experiments show that PointVLM outperforms existing methods on both generative 3D object classification and 3D object captioning tasks. The source code and pre-trained models will be available at MASKED_URL.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 3365

Loading