CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
Keywords: Causal inference, Causal reasoning, Identification, Estimation, Experimental designs, Observational Studies
TL;DR: We introduce a real-world benchmark that separately evaluates whether automated causal-inference systems identify valid research designs and correctly estimate causal effects.
Abstract: Many benchmarks for automated causal inference evaluate systems using a single numerical output, such as an average treatment effect. Such evaluations do not distinguish errors in identification—specifying the research design and its assumptions—from errors in estimation—implementing the design on finite data. We introduce CausalReasoningBenchmark, a benchmark of 173 causal queries over 138 real-world datasets, curated from 85 peer-reviewed papers and four widely used causal-inference textbooks. For each query, a system must produce both a structured identification specification, including the strategy, treatment, outcome, controls, and design-specific fields, and a numerical estimate with a standard error. This structure allows identification errors to be evaluated separately from estimation errors. The benchmark covers instrumental variables, regression discontinuity, difference-in-differences, conditional exogeneity, and randomized controlled trials. In a baseline evaluation with an LLM, the model identifies the high-level strategy in 79% of cases, but produces a fully correct identification specification in only 34% of cases. These results show that current systems often recognize broad design families while failing on the detailed variable and estimand specifications required for valid causal analysis.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 21
Loading