Abstract: Identifying and estimating causal relationships from data is a crucial component of empirical research. While large language model-powered tools have shown potential for assisting research workflows, their ability to perform end-to-end causal inference remains underexplored. We introduce CauSciBench, a benchmark that puts LLM-powered tools to the test on causality- driven research questions. Unlike previous related benchmarks that focus on coding alone, CauSciBench enables evaluation across the full pipeline of causal inference: from method and variable selection to computation of causal effects and statistical interpretation in the context of real-world research problems. We evaluated 7 frontier models on over 300 queries derived from scientific publications, textbook problems, sem- inal datasets, and synthetic scenarios. Results show that models consistently perform worse on real datasets, with the key bottleneck being the selection of an appropriate causal inference method.
Lay Summary: Causality describes the extent to which one variable influences another, and the goal of causal inference is to quantify this effect. This quantification has real stakes: it has helped identify which drugs treat diseases and whether income support programs actually lead to higher earnings. While the task of inferring causal effects is profound, it is easier said than done. Many factors can influence any given outcome. In recent years, there has been growing interest in applying large language models (LLMs) to estimate causal effects from data to answer questions of interest. Existing work focuses on assessing the ability of LLMs to implement a chosen causal model. We go a step further and study whether LLMs can build a causal model from scratch by selecting the right method and variables. To enable this, we introduce a new dataset. Our experiments show that the main challenge lies in selecting the right method to isolate a causal effect. Models often default to methods that capture correlation rather than causation, and as most of us know, correlation is not causation.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/causalNLP/CauSciBench
Primary Area: Deep Learning->Large Language Models
Keywords: Causal Reasoning
Originally Submitted PDF: pdf
Submission Number: 20719
Loading