Keywords: Causal Discovery, LLM, PC
TL;DR: LLM evaluations on popular causal discovery benchmarks are flawed; Need science-grounded novel graphs and hybrid LLM vis-a-vis data-based methods
Abstract: Recent claims of strong performance by Large Language Models (LLMs) on causal
discovery tasks are undermined by a critical flaw: many evaluations rely on bench-
marks likely included in LLMs’ pretraining data, raising concerns that apparent
success reflects memorization rather than genuine reasoning. This risks creating a
misleading narrative that LLM-only methods, which ignore observational data, out-
perform classical statistical approaches. We challenge this view by asking whether
LLMs truly reason about causal structure, how such reasoning can be measured
reliably without leakage, and whether LLMs can be trusted for causal discovery
in real scientific domains. We argue that realizing their potential for accelerating
scientific discovery requires two shifts: developing robust evaluation protocols
based on recent, unseen scientific studies to avoid dataset leakage, and designing
hybrid methods that combine LLM-derived world knowledge with statistical ap-
proaches. To this end, we outline a practical recipe for constructing causal graphs
from post-training scientific publications, ensuring evaluations remain leakage-free
while encompassing both established and novel causal relationships.
Submission Number: 22
Loading