CauSciBench: A Comprehensive Benchmark on End-to-End Causal Inference for Scientific Research

CauSciBench: A Comprehensive Benchmark on End-to-End Causal Inference for Scientific Research

ICLR 2026 Conference Submission18257 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI for Science, Causal Inference

TL;DR: We introduce CauSciBench, the first comprehensive benchmark designed to evaluate end-to-end causal inference for scientific research across 9 disciplines.

Abstract: Large language models (LLMs) are showing increasingly promising progress in accelerating scientific research, yet their ability to facilitate causal inference for scientific discovery remains underexplored. We introduce CauSciBench, the first comprehensive benchmark to evaluate end-to-end causal inference for scientific research. CauSciBench comprises 367 evaluation tasks based on 100+ real-world research papers across 9 disciplines, augmented with synthetic scenarios and textbook examples. CauSciBench is the first to probe the complete causal analysis pipeline, from natural language problem formulation through variable selection and method choice to statistical model implementation and result interpretation---all without any intermediate hints. We evaluate 6 state-of-the-art models with various test-time scaling techniques, including Chain-of-Thought, Program-of-Thought, and ReAct prompting. The best-performing OpenAI-o3 with CoT prompting still attains a mean relative error (MRE) of 48.96\% on problems derived from real-world research papers, highlighting a substantial gap between current model capabilities and the demands of research-level causal analysis. We call on the community to further explore new methods and rigorous evaluation for building agents that can reliably facilitate causal inference in the context of scientific research.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 18257

Loading