CauSciBench: Can LLMs Automate Causal Inference in Real-World Scientific Research?

Sawal Acharya; Terry Jingchen Zhang; Andrew Kim; Rahul Babu Shrestha; Xianlin Sun; Pepijn Cobben; Maximilian Mordig; Jacob T. Emmerson; Anahita Haghighat; Furkan Danisman; Yuen Chen; Clijo Jose; Andrei Ioan Muresanu; Justin Cui; Jiarui Liu; Yahang Qi; Punya Syon Pandey; Yinya Huang; Bernhard Schölkopf; Zhijing Jin

CauSciBench: Can LLMs Automate Causal Inference in Real-World Scientific Research?

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY-NC 4.0

Abstract: Identifying and estimating causal relationships from data is a crucial component of empirical research. While large language model-powered tools have shown potential for assisting research workflows, their ability to perform end-to-end causal inference remains underexplored. We introduce CauSciBench, a benchmark that puts LLM-powered tools to the test on causality- driven research questions. Unlike previous related benchmarks that focus on coding alone, CauSciBench enables evaluation across the full pipeline of causal inference: from method and variable selection to computation of causal effects and statistical interpretation in the context of real-world research problems. We evaluated 7 frontier models on over 300 queries derived from scientific publications, textbook problems, sem- inal datasets, and synthetic scenarios. Results show that models consistently perform worse on real datasets, with the key bottleneck being the selection of an appropriate causal inference method.

Lay Summary: Causality describes the extent to which one variable influences another, and the goal of causal inference is to quantify this effect. This quantification has real stakes: it has helped identify which drugs treat diseases and whether income support programs actually lead to higher earnings. While the task of inferring causal effects is profound, it is easier said than done. Many factors can influence any given outcome. In recent years, there has been growing interest in applying large language models (LLMs) to estimate causal effects from data to answer questions of interest. Existing work focuses on assessing the ability of LLMs to implement a chosen causal model. We go a step further and study whether LLMs can build a causal model from scratch by selecting the right method and variables. To enable this, we introduce a new dataset. Our experiments show that the main challenge lies in selecting the right method to isolate a causal effect. Models often default to methods that capture correlation rather than causation, and as most of us know, correlation is not causation.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/causalNLP/CauSciBench

Primary Area: Deep Learning->Large Language Models

Keywords: Causal Reasoning

Originally Submitted PDF: pdf

Submission Number: 20719

Loading