When Vision Needs a Second Look: Tool-Augmented Active Perception for Earth Observation

When Vision Needs a Second Look: Tool-Augmented Active Perception for Earth Observation

ACL ARR 2026 January Submission2670 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal AI Agents, NLP Applications

Abstract: Earth Observation (EO) uses satellite and aerial imagery to monitor the Earth’s surface, supporting critical applications in infrastructure, agriculture, and climate change. As governments and industry scale EO pipelines, reliable automation has become essential. Yet, current Vision–Language Models are limited to coarse-grained perception, struggling to execute the precise, multi-step reasoning required for operational decision-making. Recent evaluations on benchmarks like GeoBench-VLM highlight this shortcoming: even state-of-the-art models show low accuracy and frequently struggle with tasks requiring precise numerical reasoning and domain-specific knowledge, such as object counting, crop-type classification, and assessing vegetation health. These limitations stem from their static, monolithic inference pipeline, which prevents adaptive analysis and error correction. To address these limitations, we introduce `\textit{GeoScout-Agent}', an autonomous agentic framework designed to overcome these constraints by coupling GPT-5-mini’s tool capabilities. The system built upon LangChain dynamically invokes code execution, progressive zooming, sharpening, and external context verification, while DINOv3 and SAM3 provide zero-shot segmentation and high-quality feature extraction for richer LLM context. This coordinated framework enables the model to iteratively decompose, validate, and refine its predictions rather than relying on a single forward pass. Evaluated on GeoBench-VLM, our approach achieves substantial gains over standard VLM baselines. \textit{GeoScout-Agent} consistently resolves intermediate failures, improves geospatial understanding, and achieves a relative $17.3\%$ improvement across the evaluated tasks over the baseline approach. We will publicly release our code upon acceptance.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: NLP Applications, Multimodality and Language Grounding to Vision, Robotics and Beyond

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2670

Loading