Keywords: Earth Observation, AI agent, code generation
Abstract: Earth Observation (EO) provides critical planetary data for environmental monitoring, disaster management, climate science, and other scientific domains.
In this work we ask: Are AI systems ready for reliable Earth Observation?
To answer this, we introduce **UnivEARTH**, a coding benchmark of 408 yes/no questions from NASA Earth Observatory articles across 7 various topics and over 15 satellite instruments and sources.
Using Google Earth Engine API as a tool in a zero-shot setup, LLM agents achieve an accuracy of 40.0\% where the code fails to run over 44\% of the time.
To better understand LLM agent behavior, we also analyze the impact of using the JavaScript API versus Python and the effect of providing documentation. Furthermore, we find that using a reflexion framework significantly reduces errors: Claude-4.5-Sonnet, Gemini-2.5-Pro, and GPT-5 accuracies rise to around 60\%. However, these results remain only marginally above random chance.
Taken together, our findings identify significant challenges to be solved before AI agents can automate earth observation, and suggest paths forward.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Code generation benchmark, Earth Observation, AI4Science
Contribution Types: Data resources
Languages Studied: English
Submission Number: 463
Loading