When Vision Needs a Second Look: Tool-Augmented Active Perception for Earth Observation

Published: 01 Mar 2026, Last Modified: 01 Mar 2026ML4RS @ ICLR 2026 (Main)EveryoneRevisionsBibTeXCC BY 4.0
Abstract: Earth observation (EO) relies on satellite and aerial imagery, but modern vision--language models (VLMs) remain unreliable on fine-grained, numerically precise geospatial tasks. On GeoBench-VLM, they frequently fail on domain-specific problems such as object counting, crop-type recognition, and vegetation-health assessment, in part because single-pass inference offers no mechanism for targeted inspection or verification. We present GeoScout-Agent, which augments GPT-5-mini with tool-based active perception. Built on LangChain, the agent iteratively invokes zooming/sharpening, code execution, retrieval, and vision modules (DINOv3 and SAM3) to verify intermediate hypotheses and refine predictions. Across GeoBench-VLM, GeoScout-Agent improves performance by $17.3\%$ (relative) over the tool-free baseline.
Submission Number: 7
Loading