Beyond Pattern Matching: Teaching Language Models to Plan, Sequence, and Execute Reasoning Primitives
Keywords: Cognitive Reasoning, Chain-of-Thought, Reinforcement Learning, On-Policy Self-Distillation
TL;DR: ReasonPlan trains language models to reason more reliably by breaking chain of thought into explicit cognitive primitives and rewarding not just correct answers, but the quality of the reasoning process itself.
Abstract: Recent reinforcement learning (RL) methods have improved language models' chain of thought performance, but strong benchmark results often reflect pattern imitation rather than genuine reasoning. We introduce ReasonPlan, a framework that represents chain of thought traces as structured reasoning programs composed of interpretable cognitive primitives, enabling supervision over primitive diagnosis, execution, and sequencing. Starting from correct NuminaMath traces, we mine reasoning programs using a taxonomy of 12 cognitive primitives and identify which primitives are diagnostically important, execution critical, or order sensitive by contrasting correct and incorrect solutions to the same problem. We then design four primitive aware RL strategies: multi reward GRPO, rubric based LLM judging, on policy self distillation with primitive guidelines, and hierarchical primitive token injection. We evaluate ReasonPlan on 104 Reasoning Gym tasks spanning mathematics, coding, logic, and commonsense with Qwen3 4B Instruct. Beyond pass@1 and pass@8, we introduce primitive level metrics and interpretability analyses to measure more transferable, cognitively grounded reasoning.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 177
Loading