G-Verifier: Geometric Verifier for Robust 3D Point Cloud Semantic Search with Spatial Relation Reasoning

ICLR 2026 Conference Submission23134 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Spatial Reasoning, Spatial Relationship, 3D Vision-Language, Point Clouds, Representation Learning
Abstract: Semantic search in 3D point clouds is a fundamental task for Spatial Intelligence and embodied AI, yet it becomes particularly challenging when queries involve precise spatial relationships and current large-scale vision-language models often falter in these scenarios. Their reliance on monolithic, implicit attention mechanisms struggles to disentangle semantic attributes match from complex spatial geometric constraints, leading to unreliable localization. To address this issue, we introduce G-Verifier, a geometric verification module that enhances existing 3DVG frameworks by explicitly decoupling the semantic attributes match and spatial reasoning processes. Our approach realizes a Propose, Select, then Verify paradigm, where G-Verifier acts as a post-hoc re-ranker, adjudicating semantically-filtered candidates based on explicit geometric facts. The core of our module is the Rotary Spatial-Relationship Embedding (RoSE), a structured representation that dynamically fuses high-level object semantics with an explicit 3D geometric encoding. We train this module using a specialized language-alignment strategy on our new large-scale dataset, 3D-SpAn, which contains 285,177 structured spatial relationship annotations. Experiments on a challenging, manually-verified benchmark demonstrate the effectiveness of our approach. Our module itself achieves high F1-score(0.96) on a relational understanding proxy task, validating its strong discriminative power. When integrated into the end-to-end pipeline, G-Verifier improves grounding accuracy, increasing Acc@0.50(+2.50%) over a strong baseline. Our work validates that a decoupled verification approach is a promising direction for improving the geometric reasoning capabilities of large-scale 3D vision-language models.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 23134
Loading