HMFusion: Hierarchical Multi-Modality Fusion for CAD Representation

Xingyuan Li; Zongxin Yang; Yuhang Ding; Yi Yang

HMFusion: Hierarchical Multi-Modality Fusion for CAD Representation

Xingyuan Li, Zongxin Yang, Yuhang Ding, Yi Yang

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-Modality Fusion, Image Fusion

Abstract: Computer‑Aided Design (CAD) generation, which plays a vital role in product iteration and virtual simulation, has been of great interest in modern industry. Existing deep learning-based methods for CAD generation have achieved remarkable success. However, these studies either require lengthy domain-specific prompts or multiview sketches. Though effective, they exhibit limitations in ensuring consistency in geometric representation and rely on multiple inputs. To address these challenges, we propose HMFusion: Hierarchical Multi-Modality Fusion for CAD Representation, which incorporates a cross-modal geometric prior with hierarchical embeddings for consistent and faithful CAD generation. Specifically, our method introduces a prompt-enhancement module that transforms minimal user prompts into professional CAD-oriented descriptions containing structural and dimensional details. To improve the consistency of geometric representations, we tightly fuse textual and geometric information through a CAD-aware hierarchical alignment between visual and textual semantics in a hyperbolic space. Extensive experiments demonstrate that our proposed framework achieves effective geometric accuracy and semantic fidelity.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2900

Loading