SeMa3D: Lifting Vision-Language Models for Unsupervised 3D Semantic Correspondence

ICLR 2026 Conference Submission7677 Authors

16 Sept 2025 (modified: 29 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Shape Matching, 3D Correspondences, 3D Vision, Deep Learning, Multi-Modal Language Models
Abstract: We tackle unsupervised dense semantic correspondence for 3D shapes, focusing on severe \textbf{non-isometric} deformations and \textbf{inter-class} matching--a regime where conventional functional map pipelines fail due to ambiguous geometric cues. We propose \textbf{\emph{SeMa3D}}, a framework that integrates semantic knowledge from vision-language foundation models to build robust vertex-level descriptors. Specifically, SeMa3D aggregates multi-view features from visual foundation models, with a novel \emph{colorization} strategy that mitigates semantic inconsistencies across renderings, and further enriches them with \emph{text embedding fields} to capture higher-level information. These descriptors are fused with geometric priors and aligned through a functional map formulation to ensure smooth, globally consistent correspondences. To achieve semantic matching, we introduce a \emph{region-aware contrastive loss} that leverages geodesic distances and zero-shot semantic part proposals (\eg, head, leg), injecting structural intent (\eg, ``head$\rightarrow$head'') into the mapping. Extensive experiments on challenging benchmarks show that \ourmethod outperforms existing methods in both extreme non-isometric and inter-class scenarios, achieving strong accuracy and generalization without relying on 3D labels or category-specific training.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7677
Loading