Learning contrastive semantic decomposition for visual grounding

Published: 2025, Last Modified: 05 Jan 2026Neural Networks 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Visual grounding requires accurately locating and identifying image objects or regions that are described in natural language expressions. Current research predominantly utilizes independent encoders to extract visual and textual features separately, and applies fusion encoders for integrating multimodal information. However, the independent encoders ignore the shared attributes of different modalities, which may lead to a lack of consistency in multimodal fusion. Moreover, the fusion encoders may have difficulty distinguishing overlapping and non-contributing features from different modalities, resulting in redundant fusion information. To tackle these challenges, we propose a novel Contrastive Semantic Decomposition network for Visual Grounding (CSDVG) to effectively decompose shared-specific semantic features and model cross-modality features. Our CSDVG comprises two key components: an associated semantic branch for identifying shared relevant features and an independent semantic branch for isolating specific unrelated information. To further facilitate the learning of the contrasting features mentioned above, a relevance-driven loss function is proposed to optimize the balance between shared and specific features in relation to the query. In comprehensive experiments, our CSDVG demonstrates superior performance over current approaches across all datasets.
Loading