Attribute-Relation Guided Compositional Alignment for Weakly Supervised Referring Expression Comprehension

Lian Xu; Mohammed Bennamoun; Farid Boussaid; Hamid Laga; Yulan Guo; Dan Xu

Attribute-Relation Guided Compositional Alignment for Weakly Supervised Referring Expression Comprehension

Lian Xu, Mohammed Bennamoun, Farid Boussaid, Hamid Laga, Yulan Guo, Dan Xu

17 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Referring Expression Comprehension, weakly supervised learning, weakly supervised referring expression comprehension

Abstract: Referring expression comprehension (REC) aims to localize the object in an image described by natural language. Referring expressions often specify objects through diverse attributes and structured relations, but weakly supervised models often reduce these rich linguistic cues with coarse anchor features extracted from pre-trained detectors. The asymmetry between the expressive power of language and the limited granularity of visual features remains the core challenge for weakly supervised REC. Existing methods attempt to enrich anchors with auxiliary cues, which cannot capture diverse attributes or consistently improve instance distinctiveness. More importantly, they align text with individual anchors, which are unstructured representations unable to encode relational semantics. Capturing such structural cues requires explicitly modeling interactions among anchors. To address these limitations, we propose the Attribute–Relation guided Compositional Alignment (ARCA) framework. The proposed ARCA framework consists of two key components: (\textbf{i}) An attribute enhancer that introduces learnable attribute prototypes and, guided by subject noun chunks (\eg, ``a small wooden chair"), enables anchors to naturally and effectively cover diverse attribute semantics. (\textbf{ii}) A relation encoder that models inter-anchor relation representations and aligns them with full sentence embeddings, enabling the capture of structured relational cues. These two components establish a compositional alignment mechanism that enables the visual features to better match the richness and structure of language. Extensive experiments on RefCOCO, RefCOCO+, and RefCOCOg show that the proposed ARCA achieves state-of-the-art performance, demonstrating the effectiveness of compositional alignment for WREC.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 8958

Loading