CauSciBench: Assessing LLM Causal Reasoning for Scientific Research

Sawal Acharya; Terry Jingchen Zhang; Andrew Kim; Anahita Haghighat; Xianlin Sun; Rahul Babu Shrestha; Maximilian Mordig; Furkan Danisman; Clijo Jose; Yahang Qi; Pepijn Cobben; Bernhard Schölkopf; Mrinmaya Sachan; Zhijing Jin

CauSciBench: Assessing LLM Causal Reasoning for Scientific Research

Sawal Acharya, Terry Jingchen Zhang, Andrew Kim, Anahita Haghighat, Xianlin Sun, Rahul Babu Shrestha, Maximilian Mordig, Furkan Danisman, Clijo Jose, Yahang Qi, Pepijn Cobben, Bernhard Schölkopf, Mrinmaya Sachan, Zhijing Jin

Published: 23 Sept 2025, Last Modified: 02 Nov 2025NeurIPS 2025 Workshop CauScien PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal Reasoning, LLM Benchmarks

TL;DR: CauSciBench provides a solid foundation for benchmarking LLM causal reasoning by questions curated from real-world research publications across various domains, synthetic causal scenarios and textbook samples.

Abstract: While large language models (LLMs) are increasingly integrated into scientific research, their capability to perform causal inference remains under-evaluated. Existing benchmarks either focus on method execution or provide open-ended tasks lacking precision in defining causal estimands, methodological choices, and variable selection. We introduce CauSciBench, a comprehensive benchmark for assessing the causal inference capabilities of LLMs, combining expert-curated problems from published research with diverse synthetic scenarios and textbook examples. Our benchmark is the first to enable assessment across the complete causal analysis pipeline, from problem formulation through variable selection and method choice to statistical model implementation and result interpretation. We evaluate the benchmark across 3 language models and 2 prompting strategies. On real datasets, the best-performing model, OpenAI-o3, attains a mean relative error (MRE) of 53.0%. Meanwhile, for synthetic and textbook datasets, the best-performing model yields MREs of 6.2% and 30.6%, respectively. This substantial performance gap underscores both the difficulty of real-world causal inference and the opportunity for advancing LLM capabilities on this front

Submission Number: 24

Loading