Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

ACL ARR 2025 July Submission1337 Authors

29 Jul 2025 (modified: 25 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) are increasingly applied to tasks involving causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that fail to assess whether a model’s output is formally valid under causal semantics. We propose DoVerifier, a symbolic verification framework that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers that would otherwise be marked incorrect due to superficial differences. Evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness than standard metrics, offering a more rigorous and informative way to evaluate LLMs on causal tasks.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation methodologies, metrics

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Theory

Languages Studied: English

Previous URL: https://openreview.net/forum?id=hCCCOtPQYJ

Explanation Of Revisions PDF: pdf

Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).

Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: 4.2

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: Section 2 (Related Work) and Section 4.2 (LLM Causal Reasoning Test) cite and describe the use of existing benchmarks and tools, including CLadder (Jin et al., 2023) and other causal evaluation resources.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: 4.2

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: N/A

B6 Statistics For Data: Yes

B6 Elaboration: Section 4.1, Appendix E

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: We did not report model sizes or compute budgets in detail because our main contribution is a symbolic evaluation framework. While we used publicly available LLMs (e.g., LLaMA-3, Mistral) for generating outputs in our experiments, we relied on existing inference APIs or local runs and did not fine-tune or train any models. The verifier itself runs efficiently on CPU and does not require significant compute.

C2 Experimental Setup And Hyperparameters: No

C2 Elaboration: Our work does not involve training models or hyperparameter tuning. We used publicly available pre-trained language models for inference, and our symbolic verification framework operates via deterministic rule-based proof search with a fixed depth bound (e.g., 20), not through learned parameters or optimization.

C3 Descriptive Statistics: No

C3 Elaboration: Our evaluation is based on deterministic symbolic verification (e.g., whether an LLM output is formally derivable or not). The results are binary (correct/incorrect) and do not vary across runs, so summary statistics like error bars or run variability are not applicable. We report single-pass evaluations across benchmark datasets such as CLadder, and all model outputs were parsed and assessed once.

C4 Parameters For Packages: Yes

C4 Elaboration: 3.3

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: Although we used GitHub Copilot for code assistance and Grammarly for writing clarity, we did not explicitly mention this in the paper. All substantive research contributions, including theoretical development, code logic, and experimental design, were created and validated by the authors.

Author Submission Checklist: yes

Submission Number: 1337

Loading