Interactive Remote Sensing Retrieval via Dialogue-Guided Intent Refinement and Attribute Reasoning

Shan Zhao; Shuai Yan; Feng Li; Tianwei Yan; Shizhao Chen; Shezheng Song; Chengyu Wang; Qian Wan; Meng Wang

Interactive Remote Sensing Retrieval via Dialogue-Guided Intent Refinement and Attribute Reasoning

Shan Zhao, Shuai Yan, Feng Li, Tianwei Yan, Shizhao Chen, Shezheng Song, Chengyu Wang, Qian Wan, Meng Wang

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Remote Sensing Image-Text Retrieva

Abstract: Remote sensing image-text retrieval (RSITR) addresses the bidirectional retrieval problem between images and text in large-scale remote sensing databases. Despite significant progress, due to the pronounced inter-class similarity and multi-scale characteristics inherent in remote sensing (RS) imagery, current research relying on raw descriptive text often suffers from ambiguous semantics and vague user intent, thereby limiting the generalizability in real-world scenarios. To overcome these challenges, this work leverages the power of multimodal large language models (MLLMs) and creates a novel dialogue-driven cross-modal retrieval framework (DiaRet) for RSITR. DiaRet initiates a user-given query and induces multi-level semantic concepts to construct a comprehensive and deterministic understanding of the scene. In line with this promise, our method engages in a context-aware question-answer interaction to progressively clarify the vague intention. Furthermore, we introduce an LLM-based fine-grained attribute reasoning module that distills the dialogue into a structured formalism of atomic editing instructions and critical visual keywords, which enables targeted optimization and sharpens the focus on discriminative visual details. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that our DRS framework achieves state-of-the-art performance, validating the superiority of our interactive, dynamic dialogue approach for accurate RSITR.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 10127

Loading