Keywords: Multimodal AI Agents, NLP Applications
Abstract: Earth Observation (EO) uses satellite and aerial imagery to monitor the Earth’s surface, supporting critical applications in infrastructure, agriculture, and climate change. As governments and industry scale EO pipelines, reliable automation has become essential. Yet, current Vision–Language Models are limited to coarse-grained perception, struggling to execute the precise, multi-step reasoning required for operational decision-making. Recent evaluations on benchmarks like GeoBench-VLM highlight this shortcoming: even state-of-the-art models show low accuracy and frequently struggle with tasks requiring precise numerical reasoning and domain-specific knowledge, such as object counting, crop-type classification, and assessing vegetation health. These limitations stem from their static, monolithic inference pipeline, which prevents adaptive analysis and error correction. To address these limitations, we introduce `\textit{GeoScout-Agent}', an autonomous agentic framework designed to overcome these constraints by coupling GPT-5-mini’s tool capabilities. The system built upon LangChain dynamically invokes code execution, progressive zooming, sharpening, and external context verification, while DINOv3 and SAM3 provide zero-shot segmentation and high-quality feature extraction for richer LLM context. This coordinated framework enables the model to iteratively decompose, validate, and refine its predictions rather than relying on a single forward pass. Evaluated on GeoBench-VLM, our approach achieves substantial gains over standard VLM baselines. \textit{GeoScout-Agent} consistently resolves intermediate failures, improves geospatial understanding, and achieves a relative $17.3\%$ improvement across the evaluated tasks over the baseline approach. We will publicly release our code upon acceptance.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: NLP Applications, Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2670
Loading