Seeing with Words: Interpretable Language-Guided Drone Geo-localization via LLM-Enriched Semantic Attribute Alignment

Changsen Yuan, Yang-Hao Zhou, Cunhan Guo, Danjie Han, Ge Shi, Wenwu Wang

Published: 01 Jan 2025, Last Modified: 08 Jan 2026IEEE Transactions on MultimediaEveryoneRevisionsCC BY-SA 4.0
Abstract: Natural language-guided drone geo-localization (DGL) provides an intuitive and scalable mode of human-drone interaction for tasks such as search, rescue, and surveillance. Recent Vision-Language Models (VLMs) can learn semantic correspondences between text and images during fine-tuning. However, their performance in DGL tasks remains constrained, as complex instructions and cluttered scenes often cause semantic dilution and granularity mismatch, leading to weak cross-modal alignment. Consequently, the models struggle with ambiguous targets and suffer from reduced localization accuracy. To address these challenges, we propose SAA-DGL, a framework for interpretable language-guided Drone Geo-Localization that enriches Semantic Attribute Alignment (SAA) with large language models (LLMs). It introduces two parameter-free cross-modal fusion modules: (1) the LLM-driven Cross-modal Semantic Attribute Enrichment (LCSAE) module, which extracts fine-grained attributes (e.g., color, shape, position) from text and embeds them into visual features as explicit semantic anchors, producing semantically enriched cross-modal representations; and (2) the Bidirectional Feature Alignment (BFA) module, which builds fusion relationships between visual and textual features via similarity-driven mechanisms, enabling effective integration of enriched visual and textual information. This design improves cross-modal consistency and interpretability while preserving pretrained alignment priors and enhancing training stability. Experiments on the GeoText-1652 benchmark show that SAA-DGL achieves state-of-the-art performance and strong robustness under complex visual and linguistic disturbances, validating its effectiveness for challenging geo-localization scenarios. We will release the code.
Loading