CORE: Concept-Oriented Reinforcement for Bridging the Definition–Application Gap in Mathematical Reasoning

CORE: Concept-Oriented Reinforcement for Bridging the Definition–Application Gap in Mathematical Reasoning

ICLR 2026 Conference Submission10456 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: large language models, mathematical reasoning, conceptual understanding, fine-tuning, knowledge distillation, robustness

Abstract: Large language models (LLMs) often solve drill-style math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular outcome-based RL pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than concept selection and use. We introduce $\textit{CORE}$ (Concept-Oriented REinforcement), an algorithm-agnostic training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions while missing concept-linked quizzes, quantifying the conceptual reasoning gap. $\textit{CORE}$ then (i) synthesizes additional concept-aligned quizzes, (ii) injects concept snippets into rollouts, and (iii) reinforces trajectories that correctly apply the injected concept while constraining drift with a lightweight divergence penalty; the procedure is compatible with standard policy-gradient methods (e.g., GRPO). On a 7B-class model, $\textit{CORE}$ yields consistent gains over a vanilla baseline and reinforcement-only training across in-domain concept–exercise suites and diverse out-of-domain math benchmarks (GSM8K, SVAMP, MAWPS, SAT-Math, OlympiadBench, Gaokao, Minerva-Math, CounterMath, TheoremQA). Improvements are largest on concept-heavy categories while maintaining or modestly improving drill performance. $\textit{CORE}$ demonstrates that concept-injected, outcome-regularized rollouts supply the missing fine-grained supervision needed to bridge drill competence and true conceptual reasoning—without committing to a particular RL algorithm or certain process-based verifiers.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 10456

Loading