Uncovering Hidden Correctness in LLM Causal Reasoning via Symbolic Verification

ACL ARR 2025 July Submission1337 Authors

29 Jul 2025 (modified: 25 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are increasingly applied to tasks involving causal reasoning. However, current benchmarks often rely on string matching or surface-level metrics that fail to assess whether a model’s output is formally valid under causal semantics. We propose DoVerifier, a symbolic verification framework that checks whether LLM-generated causal expressions are derivable from a given causal graph using rules from do-calculus and probability theory. This allows us to recover correct answers that would otherwise be marked incorrect due to superficial differences. Evaluations on synthetic data and causal QA benchmarks show that DoVerifier more accurately captures semantic correctness than standard metrics, offering a more rigorous and informative way to evaluate LLMs on causal tasks.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation methodologies, metrics
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models, Data resources, Theory
Languages Studied: English
Previous URL: https://openreview.net/forum?id=hCCCOtPQYJ
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: No, I want the same area chair from our previous submission (subject to their availability).
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 4.2
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Section 2 (Related Work) and Section 4.2 (LLM Causal Reasoning Test) cite and describe the use of existing benchmarks and tools, including CLadder (Jin et al., 2023) and other causal evaluation resources.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 4.2
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: Section 4.1, Appendix E
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: We did not report model sizes or compute budgets in detail because our main contribution is a symbolic evaluation framework. While we used publicly available LLMs (e.g., LLaMA-3, Mistral) for generating outputs in our experiments, we relied on existing inference APIs or local runs and did not fine-tune or train any models. The verifier itself runs efficiently on CPU and does not require significant compute.
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: Our work does not involve training models or hyperparameter tuning. We used publicly available pre-trained language models for inference, and our symbolic verification framework operates via deterministic rule-based proof search with a fixed depth bound (e.g., 20), not through learned parameters or optimization.
C3 Descriptive Statistics: No
C3 Elaboration: Our evaluation is based on deterministic symbolic verification (e.g., whether an LLM output is formally derivable or not). The results are binary (correct/incorrect) and do not vary across runs, so summary statistics like error bars or run variability are not applicable. We report single-pass evaluations across benchmark datasets such as CLadder, and all model outputs were parsed and assessed once.
C4 Parameters For Packages: Yes
C4 Elaboration: 3.3
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: Although we used GitHub Copilot for code assistance and Grammarly for writing clarity, we did not explicitly mention this in the paper. All substantive research contributions, including theoretical development, code logic, and experimental design, were created and validated by the authors.
Author Submission Checklist: yes
Submission Number: 1337
Loading