Not All Votes Count! Translated Program for Verification Improves Self-Consistency of Language Models for Math Reasoning
Keywords: mathematical reasoning, python programs, verification
TL;DR: We propose Prove, a simple yet effective framework that uses program-based verification as a heuristic to filter out potentially incorrect reasoning paths before aggregating the final answers.
Abstract: Large language models (LLMs) have become increasingly capable of solving mathematical reasoning problems. However, many open-source LLMs still encounter issues with calculation errors and semantic misunderstandings during intermediate reasoning steps. In this work, we present Prove, a simple yet effective framework that leverages translated Python programs derived from natural language solutions as a verification mechanism. This verification mechanism helps identify and filter out potentially incorrect paths before final answers are aggregated. Unlike basic majority voting, our approach rejects solutions whose program outputs do not align with the generated solution, only aggregating those that pass the verification step. We conducted extensive experiments with 13 open-source LLMs of various model sizes, ranging from 0.5B to 13B parameters, across eight mathematical benchmarks. Our findings demonstrate that Prove consistently outperforms basic majority voting as a heuristic and other program-assisted reasoning baselines for solving mathematical reasoning tasks, achieving improvements of up to 18\% on GSM8K and 8\% on MATH-500. Our codes are available at https://github.com/declare-lab/prove.
Submission Number: 154
Loading