Realizing LLMs’ Causal Potential Requires Science-Grounded, Novel Benchmarks

Ashutosh Srivastava; Lokesh Nagalapatti; Gautam Jajoo; Aniket Vashishtha; Parameswari Krishnamurthy; Amit Sharma

Realizing LLMs’ Causal Potential Requires Science-Grounded, Novel Benchmarks

Ashutosh Srivastava, Lokesh Nagalapatti, Gautam Jajoo, Aniket Vashishtha, Parameswari Krishnamurthy, Amit Sharma

23 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal Discovery, LLM, PC

TL;DR: LLM evaluations on popular causal discovery benchmarks are flawed; Need science-grounded novel graphs and hybrid LLM vis-a-vis data-based methods

Abstract: Recent claims of strong performance by Large Language Models (LLMs) on causal discovery tasks are undermined by a critical flaw: many evaluations rely on widely-used benchmarks that likely appear in LLMs' pretraining corpora. As a result, empirical success on these benchmarks seem to suggest that LLM-only methods, which ignore observational data, outperform classical statistical approaches on causal discovery. In this position paper, we challenge this emerging narrative by raising a fundamental question: Are LLMs truly reasoning about causal structure, and if so, how do we measure it reliably without any memorization concerns? And can they be trusted for causal discovery in real-world scientific domains? We argue that realizing the true potential of LLMs for causal analysis in scientific research demands two key shifts. First, (P.1) the development of robust evaluation protocols based on recent scientific studies that effectively guard against dataset leakage. Second, (P.2) the design of hybrid methods that combine LLM-derived world knowledge with data-driven statistical methods. To address P.1, we motivate the research community to evaluate discovery methods on real-world, novel scientific studies, so that the results hold relevance for modern science. We provide a practical recipe for extracting causal graphs from recent scientific publications released after the training cutoff date of a given LLM. These graphs not only prevent verbatim memorization but also typically encompass a balanced mix of well-established and novel causal relationships. Compared to widely used benchmarks from BNLearn, where LLMs achieve near-perfect accuracy, LLMs perform significantly worse on our curated graphs, underscoring the need for statistical methods to bridge the gap. To support our second position (P.2), we show that a simple hybrid approach that uses LLM predictions as priors for the classical PC algorithm significantly improves accuracy over both LLM-only and traditional data-driven methods. These findings motivate a call to the research community: adopt science-grounded benchmarks that minimize dataset leakage, and invest in hybrid methodologies that are better suited to the nuanced demands of real-world scientific inquiry.

Submission Number: 744

Loading