CauSciBench: Assessing LLM Causal Reasoning for Scientific Research

Published: 23 Sept 2025, Last Modified: 18 Oct 2025NeurIPS 2025 Workshop CauScien PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Causal Reasoning, LLM Benchmarks
TL;DR: CauSciBench provides a solid foundation for benchmarking LLM causal reasoning by questions curated from real-world research publications across various domains, synthetic causal scenarios and textbook samples.
Abstract: While large language models (LLMs) are increasingly integrated into scientific research, their capability to perform causal inference, a cornerstone of scientific induction, remains under-evaluated. Existing benchmarks either focus narrowly on verifying method execution or provide open-ended tasks that lack precision in defining causal estimands, methodological choices, and variable selection. To address this gap, we introduce CauSciBench, a comprehensive benchmark that combines expert-curated problems from published research papers with diverse synthetic scenarios. Our benchmark spans both the potential outcomes framework and Pearl's structural causal model (SCM) framework, enabling systematic evaluation of LLM causal reasoning capabilities.
Submission Number: 24
Loading