GeoAssistant: A Geospatial Vision and Language Assistant that Plugs and Learns to Use Tools for Remote Sensing

GeoAssistant: A Geospatial Vision and Language Assistant that Plugs and Learns to Use Tools for Remote Sensing

ICLR 2026 Conference Submission22510 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: CV: Remote Sensing, Geospatial AI

TL;DR: GeoAssistant is a tool-augmented, multimodal assistant that can autonomously uses external tools to analyze both optical and Synthetic Aperture Radar imagery. It demonstrates strong performance across a range of remote sensing tasks.

Abstract: Vision-language models (VLMs) hold great potential for interpreting large-scale remote sensing (RS) archives, which are critical for applications such as environmental monitoring, disaster response, and urban planning. However, general-purpose VLMs perform poorly on RS tasks, and existing RS-specific VLMs still struggle with fine-grained understanding and focus primarily on optical imagery. To address these limitations, we propose GeoAssistant, a tool-augmented multimodal assistant tailored for RS scenarios. GeoAssistant interprets user instructions, autonomously determines whether to invoke external tools, and synthesizes their outputs to generate precise responses. A key innovation of our approach is its capability to process both optical and Synthetic Aperture Radar (SAR) imagery, enabling a wide range of tasks, including visual grounding, object detection, segmentation, and multifaceted reasoning. To support this, we construct the first cross-domain, tool-augmented instruction dataset for RS, addressing the critical challenge of task-specific data scarcity. We also introduce GeoAssistBench, a comprehensive benchmark for cross-domain, multi-task dialogue in RS, and use it to evaluate GeoAssistant. Our results show that GeoAssistant consistently outperforms existing RS-specific VLMs across diverse tasks, demonstrating its practical value for real-world RS applications.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 22510

Loading