When Vision Needs a Second Look: Tool-Augmented Active Perception for Earth Observation

Atul Dev; Shivank Garg

When Vision Needs a Second Look: Tool-Augmented Active Perception for Earth Observation

Atul Dev, Shivank Garg

Published: 01 Mar 2026, Last Modified: 05 Apr 2026ML4RS @ ICLR 2026 (Main)EveryoneRevisionsBibTeXCC BY 4.0

Abstract: Earth observation (EO) relies on satellite and aerial imagery, but modern vision--language models (VLMs) remain unreliable on fine-grained, numerically precise geospatial tasks. On GeoBench-VLM, they frequently fail on domain-specific problems such as object counting, crop-type recognition, and vegetation-health assessment, in part because single-pass inference offers no mechanism for targeted inspection or verification. We present GeoScout-Agent, which augments GPT-5-mini with tool-based active perception. Built on LangChain, the agent iteratively invokes zooming/sharpening, code execution, retrieval, and vision modules (DINOv3 and SAM3) to verify intermediate hypotheses and refine predictions. Across GeoBench-VLM, GeoScout-Agent improves performance by $17.3\$% (relative) over the tool-free baseline.

Submission Number: 7

Loading