Keywords: AI for Science, Causal Inference
TL;DR: We introduce CauSciBench, the first comprehensive benchmark designed to evaluate end-to-end causal inference for scientific research across 9 disciplines.
Abstract: Large language models (LLMs) are showing increasingly promising progress in accelerating scientific research, yet their ability to facilitate causal inference for scientific discovery remains underexplored.
We introduce CauSciBench, the first comprehensive benchmark to evaluate end-to-end causal inference for scientific research.
CauSciBench comprises 367 evaluation tasks based on 100+ real-world research papers across 9 disciplines, augmented with synthetic scenarios and textbook examples.
CauSciBench is the first to probe the complete causal analysis pipeline, from natural language problem formulation through variable selection and method choice to statistical model implementation and result interpretation---all without any intermediate hints.
We evaluate 6 state-of-the-art models with various test-time scaling techniques, including Chain-of-Thought, Program-of-Thought, and ReAct prompting.
The best-performing OpenAI-o3 with CoT prompting still attains a mean relative error (MRE) of 48.96\% on problems derived from real-world research papers, highlighting a substantial gap between current model capabilities and the demands of research-level causal analysis.
We call on the community to further explore new methods and rigorous evaluation for building agents that can reliably facilitate causal inference in the context of scientific research.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 18257
Loading