CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models

Published: 10 Oct 2024, Last Modified: 01 Nov 2024CaLM @NeurIPS 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Causal Benchmark, LLMs, Causal Graph
TL;DR: Created a comprehensive benchmark, which can evaluate the causal reasoning capability of LLMs
Abstract: Causal reasoning, a core aspect of human cognition, is essential for advancing large language models (LLMs) towards artificial general intelligence (AGI) and reducing their propensity for generating hallucinations. However, existing datasets for evaluating causal reasoning in LLMs are limited by narrow domain coverage and a focus on cause-to-effect reasoning through textual problems, which does not comprehensively assess whether LLMs truly grasp causal relationships or merely guess correct answers. To address these shortcomings, we introduce a novel benchmark that spans textual, mathematical, and coding problem domains. Each problem is crafted to probe causal understanding from four perspectives: cause-to-effect, effect-to-cause, cause-to-effect with intervention, and effect-to-cause with intervention. This multi-dimensional evaluation method ensures that LLMs must exhibit a genuine understanding of causal structures by correctly answering questions across all four dimensions, mitigating the possibility of cor14 rect responses by chance. Furthermore, our benchmark explores the relationship between an LLM’s causal reasoning performance and its tendency to produce hallucinations. We present evaluations of state-of-the-art LLMs using our bench17 mark, providing valuable insights into their current causal reasoning capabilities across diverse domains. The dataset is publicly available for download at https://huggingface.co/datasets/CCLV/CausalBench.
Submission Number: 15
Loading