R2P: Reformulate–Retrieve–Program for Robust Mathematical Reasoning in LLMs

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY-NC 4.0
Keywords: Large Language Models, Mathematical Reasoning, In-Context Learning, Program-of-Thoughts
Abstract: Large language models (LLMs) remain brittle on mathematical word problems: small surface-form changes can shift answer distributions and degrade solve rates, while multi-step computation is error-prone. We present R2P, a three-stage inference framework that (1) Reformulates each problem into diverse paraphrases to reduce surface-form bias, (2) Retrieves domain-aligned few-shot exemplars from a curated bank via lightweight embeddings, and (3) Programs explicit intermediate Python code (Program-of-Thoughts) to execute symbolic computations. Under a fixed path budget, R2P samples multiple reasoning trajectories across reformulations and aggregates outcomes by voting, improving both accuracy and consistency. Evaluations on GSM8K, AQuA, and SVAMP with an off-the-shelf 9B-parameter LLM (zero/few-shot, no fine-tuning) show consistent gains over Chain-of-Thought, self-consistency, and vanilla Program-of-Thoughts. Ablations varying the number of reformulations and comparing naïve vs. in-context reformulation demonstrate that (i) exposing the model to multiple surface forms reliably improves solve rates, and (ii) domain-aware retrieval further boosts robustness. We analyze typical failure modes—misinterpretation of quantities and arithmetic slips—and show how reformulation plus code execution mitigates both. R2P offers a simple, model-agnostic recipe for more reliable mathematical reasoning without additional training.
Submission Type: Research Paper (4-9 Pages)
Submission Number: 122
Loading