BoostStep: Boosting Mathematical Capability of Large Language Models via Step-aligned In Context Learning

Beichen Zhang; Yuhong Liu; Xiaoyi Dong; Yuhang Zang; Pan Zhang; Haodong Duan; Yuhang Cao; Dahua Lin; Jiaqi Wang

BoostStep: Boosting Mathematical Capability of Large Language Models via Step-aligned In Context Learning

Beichen Zhang, Yuhong Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Haodong Duan, Yuhang Cao, Dahua Lin, Jiaqi Wang

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mathematical Reasoning, Large Language Models, In-context Learning

TL;DR: BoostStep integrates a step-aligned in-context learning mechanism, effectively enhancing the math reasoning performance of SOTA reasoning models like GPT-4o and DeepSeek-R1.

Abstract: Large language models (LLMs) have demonstrated impressive ability in solving complex mathematical problems with multi-step reasoning and can be further enhanced with well-designed in-context learning (ICL) examples. However, this potential is often constrained by two major challenges in ICL: granularity mismatch and irrelevant information. We observe that while LLMs excel at decomposing mathematical problems, they often struggle with reasoning errors in fine-grained steps. Moreover, ICL examples retrieved at the question level may omit critical steps or even mislead the model with irrelevant details. To address this issue, we propose BoostStep, a method that enhances reasoning accuracy through step-aligned ICL, a novel mechanism that carefully aligns retrieved reference steps with the corresponding reasoning steps. Additionally, BoostStep incorporates an effective "first-try" strategy to retrieve for exemplars highly relevant to the current state of reasoning. BoostStep is a flexible and powerful method that integrates seamlessly with chain-of-thought (CoT) and tree search algorithms, refining both candidate selection and decision-making. Empirical results show that BoostStep improves GPT-4o’s CoT performance by 4.6\% across mathematical benchmarks, significantly surpassing traditional few-shot learning's 1.2\%. Moreover, it can achieve an additional 7.5\% gain combined with tree search. Surprisingly, it enhances state-of-the-art LLMs to solve challenging math problems using simpler examples. It improves DeepSeek-R1-671B and Qwen3-235B’s performance on AIME by 2.2\% and 5.0\% respectively, leveraging simple examples only from the MATH dataset.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 3297

Loading