Improving LLM Reasoning through Collaborative Verification between Natural Languages and Programs

ACL ARR 2025 February Submission906 Authors

11 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite significant advancements in the general capability of large language models (LLMs), they continue to struggle with consistent and accurate reasoning. One key limitation is that LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors, which hampers their ability to reliably verify and rank outputs. To address this, we focus on inference-time verification for reasoning, where verifiers assess and rank generated outputs by correctness. To better understand different verifier training methods, we introduce a comprehensive dataset consisting of correct and incorrect solutions for math and programming tasks, generated by multiple LLMs. This diverse set of solutions enables verifiers to more effectively distinguish and rank correct answers from erroneous outputs. Moreover, to leverage the unique strengths of different reasoning strategies, we propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification. Our verifier Math-Rev demonstrates substantial performance gains over existing LLMs, achieving state-of-the-art results on GSM8k and MATH.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Reasoning Verification, LLM
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 906
Loading