Realizing LLMs’ Causal Potential Requires Science-Grounded, Novel Benchmarks

Published: 23 Sept 2025, Last Modified: 18 Oct 2025NeurIPS 2025 Workshop CauScien PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Causal Discovery, LLM, PC
TL;DR: LLM evaluations on popular causal discovery benchmarks are flawed; Need science-grounded novel graphs and hybrid LLM vis-a-vis data-based methods
Abstract: Recent claims of strong performance by Large Language Models (LLMs) on causal discovery tasks are undermined by a critical flaw: many evaluations rely on bench- marks likely included in LLMs’ pretraining data, raising concerns that apparent success reflects memorization rather than genuine reasoning. This risks creating a misleading narrative that LLM-only methods, which ignore observational data, out- perform classical statistical approaches. We challenge this view by asking whether LLMs truly reason about causal structure, how such reasoning can be measured reliably without leakage, and whether LLMs can be trusted for causal discovery in real scientific domains. We argue that realizing their potential for accelerating scientific discovery requires two shifts: developing robust evaluation protocols based on recent, unseen scientific studies to avoid dataset leakage, and designing hybrid methods that combine LLM-derived world knowledge with statistical ap- proaches. To this end, we outline a practical recipe for constructing causal graphs from post-training scientific publications, ensuring evaluations remain leakage-free while encompassing both established and novel causal relationships.
Submission Number: 22
Loading