Global Vision-Language Feature Interaction Enhanced by Object-Context Association for Remote Sensing Visual Grounding

Jun Xie, Bing Zhang, Zhengchao Chen, Xuan Yang, Yongqing Bai, Zhaoming Wu, Yue Xu

Published: 2025, Last Modified: 28 May 2026IEEE Trans. Geosci. Remote. Sens. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Remote sensing visual grounding (RSVG) aims to accurately localize specific targets in remote sensing (RS) images based on natural language descriptions. However, existing RSVG datasets often contain overly simplistic textual descriptions, exhibit imbalanced object size distributions, and lack semantic connections between targets and the surrounding contexts. Moreover, current approaches rely on global visual features for region-level localization, while the interaction between visual and textual modalities remains limited. To address these challenges, we propose improvements from both the dataset and algorithm perspectives. First, we explore the semantic correlation between RS objects and their scene context. Based on high-resolution Gaofen satellite imagery, we expand several typical object categories and construct richer textual descriptions that reflect object-background associations. By refining and extending the existing DIOR-RSVG dataset, we build a new dataset named DGF-RSVG. Second, to enhance the semantic alignment between global visual features and textual features, we propose a novel global vision-language multimodal feature interaction enhancement (GME) module. In parallel, we design a local attention enhancement (LAE) module to facilitate fine-grained interaction between object-related textual features and regional visual proposals. These two modules form the foundation of our newly developed detection framework: the global–local attention enhanced detector (GLAED). Extensive experiments show that GLAED achieves state-of-the-art performance on the DGF-RSVG dataset, outperforming the closest competitor by 5.01% in Pr@0.5. It also achieves highly competitive results on the DIOR-RSVG dataset, demonstrating the effectiveness of both our proposed dataset and model framework.
Loading