When AI Can’t Reproduce the Planet: Reproducibility Drift in Environmental AI Systems

Gokul Srinath Seetha Ram

When AI Can’t Reproduce the Planet: Reproducibility Drift in Environmental AI Systems

Gokul Srinath Seetha Ram

Published: 27 Jan 2026, Last Modified: 27 Jan 2026AAAI 2026 AI4ES OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: environmental AI, reproducibility, large language models, data drift, LLaMA models, environmental datasets, scientific reliability, AI reproducibility audit, multi-turn evaluation, meta-reproducibility

TL;DR: Benchmarking how Large Language Models lose environmental data references across turns—revealing reproducibility drift in AI-generated environmental research.

Abstract: Reproducibility is a cornerstone of scientific reliability, yet today’s AI assistants themselves often fail this test. Large Language Models (LLMs) are increasingly used for environmental analyses and data references, yet their ability to maintain consistent data references across multi-turn conversations remains largely unexplored. This study introduces the concept of environmental reproducibility drift—the phenomenon where environmental data references mutate, disappear, or get fabricated during extended LLM interactions. Through a comprehensive analysis of 240 conversations across 4 LLaMA models using 36 authentic environmental datasets from 6 domains, this work demonstrates significant data reference instability. Results reveal that environmental reference stability varies dramatically across models, with LLaMA-4-Maverick-17B showing the highest stability (0.481) and LLaMA-4-Scout-17B showing the worst fabrication rates (0.856). This study introduces novel metrics including environmental drift entropy and willingness-to-reference data, providing a framework for evaluating LLM data reference reliability in environmental contexts. We frame environmental reproducibility drift as a meta-reproducibility benchmark revealing that LLMs cannot reproduce their own environmental outputs consistently. Instability in reproduced environmental outputs threatens policy communication and risk assessment (e.g., flood or wildfire warnings). Our benchmark offers a practical reliability audit for environmental AI tools prior to deployment.

Submission Number: 28

Loading