CLadder: A Benchmark to Assess Causal Reasoning Capabilities of Language Models
Keywords: Large Language Models, Causal Reasoning, Causal Inference, Benchmark Dataset, Natural Language Processing
TL;DR: We create a large-scale dataset, CLadder, as a benchmark to probe causal reasoning abilities in LLMs.
Abstract: The ability to perform causal reasoning is widely considered a core feature of intelligence. In this work, we investigate whether large language models (LLMs) can coherently reason about causality. Much of the existing work in natural language processing (NLP) focuses on evaluating _commonsense_ causal reasoning in LLMs, thus failing to assess whether a model can perform causal inference in accordance with a set of well-defined _formal rules_. To address this, we propose a new NLP task, _causal inference in natural language_, inspired by the _“causal inference engine”_ postulated by Judea Pearl et al. We compose a large dataset, CLadder, with 10K samples: based on a collection of causal graphs and queries (associational, interventional, and counterfactual), we obtain symbolic questions and ground-truth answers, through an oracle causal inference engine. These are then translated into natural language. We evaluate multiple LLMs on our dataset, and we introduce and evaluate a bespoke chain-of-thought prompting strategy, CausalCoT. We show that our task is highly challenging for LLMs, and we conduct an in-depth analysis to gain deeper insight into the causal reasoning abilities of LLMs
Submission Number: 1414