A Multi-Agent Framework and Challenging Benchmark for Causal Relationship Extraction

A Multi-Agent Framework and Challenging Benchmark for Causal Relationship Extraction

ACL ARR 2026 January Submission2767 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Causal Relationship Extraction, Deceptive Correlation, Multi-Agent Framework, Categorized Benchmark Dataset

Abstract: Extracting accurate causal relationships from text is crucial for developing Causal Knowledge Graphs (CKGs), which support advanced reasoning and decision-making. Traditional approaches often struggle with linguistic ambiguity and the complexity of natural language. Existing benchmarks, like SemEval-2007 Task 4, primarily feature short sentences, limiting the evaluation of modern Large Language Models (LLMs) in longer contexts. In this study, we present two key contributions: (1) a novel Multi-Agent Causal Extraction System that employs a multistage verification process with a Judge agent for relationship extraction and a Critic agent for reasoning verification; and (2) a Categorized Benchmark Dataset containing 10,000 long-context examples across 20 causal and non-causal categories, including “deceptive correlations” to test models' capabilities. Our experiments reveal that while our system achieves human-level performance (89.66\%) on SemEval-2007, accuracy drops to 70.00\% on our benchmark, highlighting the need for more rigorous evaluations in causal reasoning.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: Causal Relationship Extraction, Large Language Models (LLMs), Benchmark Dataset, Deceptive Correlations

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data resources

Languages Studied: English

Submission Number: 2767

Loading