3D-ATRES: Ambiguity-Tolerant Learning for 3D Referring Expression Segmentation

05 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: 3D Referring Expression Segmentation, Ambiguity
Abstract: 3D Referring Expression Segmentation (3D-RES) is an emerging yet challenging task at the interaction of 3D vision and language, which aims to precisely segment a target instance within a 3D point cloud based on a given natural language referring expression. However, most previous methods overlook multi-source ambiguities that are prevalent in real-world scenarios, including prompt, spatial, and annotation ambiguities. Prompt ambiguity arises from confusion between referent and target instances due to ambiguous language, spatial ambiguity results from viewpoint variations causing incomplete segmentation, and annotation ambiguity stems from inconsistent or noisy labeling in training data. In this paper, we propose a novel 3D Ambiguity-Tolerant Referring Expression Segmentation (3D-ATRES), which explicitly models and mitigates multi-source ambiguities in 3D-RES. Specifically, we employ $\text{TR}^{2}$ Semantic Structurizer to transform free-form natural language into structured Target-Relation-Referent triples, thereby eliminating referential ambiguity. For spatial ambiguity, we introduce a Normal‑Aware Spatial Alignment that leverages surface normal cues to achieve viewpoint-consistent geometry alignment. To mitigate annotation ambiguity, we introduce an Annotation Ambiguity Penalty, which enables the network to adaptively learn from noisy or inconsistent annotations through confidence evaluation. Experiments on ScanRefer and Multi3DRefer show that 3D-ATRES achieves state-of-the-art performance, confirming the effectiveness of modeling ambiguity in 3D-RES.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 2390
Loading