Strategy Executability in Mathematical Reasoning: Leveraging Human–Model Differences for Effective Guidance

Weida Liang; Yiyou Sun; Shuyuan Nan; Chuang Li; Dawn Song; Kenji Kawaguchi

Strategy Executability in Mathematical Reasoning: Leveraging Human–Model Differences for Effective Guidance

Weida Liang, Yiyou Sun, Shuyuan Nan, Chuang Li, Dawn Song, Kenji Kawaguchi

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Example-based guidance is widely used to improve mathematical reasoning at inference time, yet its effectiveness is highly unstable across problems and models—even when the guidance is correct and problem-relevant. We show that this instability arises from a previously underexplored gap between *strategy usage*—whether a reasoning strategy appears in successful solutions—and *strategy executability*—whether the strategy remains effective when instantiated as guidance for a target model. Through a controlled analysis of paired human-written and model-generated solutions, we identify a systematic dissociation between usage and executability: human- and model-derived strategies differ in structured, domain-dependent ways, leading to complementary strengths and consistent source-dependent reversals under guidance. Building on this diagnosis, we propose *Selective Strategy Retrieval* (SSR), a test-time framework that explicitly models executability by selectively retrieving and combining strategies using empirical, multi-route, source-aware signals. Across multiple mathematical reasoning benchmarks, SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance, improving accuracy by up to $+13$ points on AIME25 and $+5$ points on Apex for compact reasoning models. Code and benchmark are publicly available at: https://github.com/lwd17/strategy-execute-pipeline.

Lay Summary: Large language models can solve many math problems better when they are given examples, hints, or solution strategies. However, this kind of guidance is often unreliable: even a correct and relevant strategy may not help a model, because the model may not be able to actually carry it out. In this paper, we study this problem by comparing how humans and language models solve the same challenging math problems. We find that humans and models often use different kinds of strategies. Humans tend to rely more on high-level insights and problem structure, while models often use more step-by-step algebraic or procedural approaches. Importantly, a strategy that appears in a successful solution is not always a strategy that another model can successfully use as guidance. Based on this observation, we introduce Selective Strategy Retrieval, a method that chooses strategies according to how executable they are for the target model, rather than only how similar or common they are. Across several math reasoning benchmarks, this approach improves model accuracy and makes guidance more reliable.

Primary Area: Deep Learning->Large Language Models

Keywords: large language models, mathematical reasoning, strategy executability

Originally Submitted PDF: pdf

Submission Number: 4069

Loading