Keywords: Remote Sensing Image-Text Retrieva
Abstract: Remote sensing image-text retrieval (RSITR) addresses the bidirectional retrieval problem between images and text in large-scale remote sensing databases. Despite significant progress, due to the pronounced inter-class similarity and multi-scale characteristics inherent in remote sensing (RS) imagery, current research relying on raw descriptive text often suffers from ambiguous semantics and vague user intent, thereby limiting the generalizability in real-world scenarios. To overcome these challenges, this work leverages the power of multimodal large language models (MLLMs) and creates a novel dialogue-driven cross-modal retrieval framework (DiaRet) for RSITR. DiaRet initiates a user-given query and induces multi-level semantic concepts to construct a comprehensive and deterministic understanding of the scene. In line with this promise, our method engages in a context-aware question-answer interaction to progressively clarify the vague intention. Furthermore, we introduce an LLM-based fine-grained attribute reasoning module that distills the dialogue into a structured formalism of atomic editing instructions and critical visual keywords, which enables targeted optimization and sharpens the focus on discriminative visual details. Extensive experiments on the RSICD and RSITMD benchmarks demonstrate that our DRS framework achieves state-of-the-art performance, validating the superiority of our interactive, dynamic dialogue approach for accurate RSITR.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 10127
Loading