An LLM-driven framework for cosmological model-building and exploration

Nayantara Mudur; Carolina Cuesta-Lazaro; Michael W. Toomey; Douglas Finkbeiner

An LLM-driven framework for cosmological model-building and exploration

Nayantara Mudur, Carolina Cuesta-Lazaro, Michael W. Toomey, Douglas Finkbeiner

Published: 31 Jul 2025, Last Modified: 17 Aug 2025LM4SciEveryoneRevisionsBibTeXCC BY 4.0

Keywords: research pipeline automation, model discovery

TL;DR: We introduce a framework to assess whether LLM-driven agents can implement and propose cosmological models.

Abstract: Our understanding of how the Universe evolved from its earliest moments to today relies on the existence of dark energy and dark matter—mysterious components detectable only through their gravitational effects, despite accounting for 95\% of the Universe. Recent surveys reveal systematic discrepancies in the temporal evolution of dark energy, potentially pointing toward new physics. Given the success of Large Language Models at completing research-level tasks such as coding and mathematical reasoning, we investigate the capability of LLMs to autonomously propose, implement, and test different cosmological theories. We leverage our framework to challenge Claude Code in three experimental settings: (1) implementing alternative models from curated descriptions by modifying a physics simulation codebase, (2) performing the same implementation directly from research papers, and (3) generating novel hypotheses for dark energy evolution to better explain recent observations. Across two benchmark models with ground-truth implementations, Claude Code successfully implemented a "Thawing Quintessence" dark energy model. However, it failed to generate correct observables for an "Early Dark Energy" model despite successful code compilation. When working directly from papers rather than curated descriptions, numerical accuracy degraded significantly though qualitative behavior remains correct. Most remarkably, Claude Code's self-proposed dark energy model achieved a better statistical fit to observations than our standard model, though at the cost of additional parameters.

Submission Number: 25

Loading