NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

NoisyCausal: A Benchmark for Evaluating Causal Reasoning Under Structured Noise

ACL ARR 2025 July Submission904 Authors

29 Jul 2025 (modified: 24 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Causal reasoning in natural language requires identifying relevant variables, understanding their interactions, and reasoning about effects and interventions, often under noisy or ambiguous conditions. While large language models (LLMs) exhibit strong general reasoning abilities, they struggle to disentangle correlation from causation, particularly when observations are partially incorrect or irrelevant information is present. In this work, we introduce \textit{NoisyCausal}, a new benchmark designed to evaluate causal reasoning under structured noise. Each instance is generated from a ground-truth causal graph and contextualized with a natural language scenario by injecting controllable forms of noise, such as irrelevant distractors, value perturbations, confounding, and partial observability. Moreover, we propose a modular reasoning framework that combines LLMs with explicit causal structure to address these challenges. Our method prompts the LLM to extract variables, construct a causal graph from context, and then reformulates the reasoning task as a structured prompt grounded in this graph. Rather than relying on statistical patterns alone, the LLM is guided by symbolic structure, enabling more interpretable and robust inference. Experimental results show that our method significantly outperforms standard prompting and reasoning baselines on \textit{NoisyCausal}. Furthermore, it generalizes well to external benchmarks such as Cladder without task-specific tuning. Our findings highlight the importance of combining causal abstractions with language-driven reasoning to achieve faithful and robust causal understanding in LLMs.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Resources and Evaluation

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Previous URL: https://openreview.net/forum?id=aply0nmQSD

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

A2 Elaboration: There is no potential risks of the paper

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 2 describe related works and dataset we used.

B2 Discuss The License For Artifacts: No

B2 Elaboration: The datasets are public available and used by previous research papers.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: The dataset are created for research use

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: section 5 provide details of dataset

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: Our key contribution is create a new dataset rather than design a new architecture

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: section 5 we provide details of our experiments

C3 Descriptive Statistics: No

C3 Elaboration: Due to the cost of use LLM, we test the methods for a single run.

C4 Parameters For Packages: No

C4 Elaboration: We didn't use such packages

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: Yes

E1 Elaboration: Section 7

Author Submission Checklist: yes

Submission Number: 904

Loading