Keywords: Mathematical Reasoning, Curriculum Learning, Reinforcement Learning, Efficiency
TL;DR: A novel framework based on tailored curriculum learning and a thinking vs no-thinking strategy improves accuracy and substantially reduces computation on standard maths reasoning benchmarks
Abstract: Large Language Models (LLMs) have recently demonstrated remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT reasoning typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant reasoning steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. Our framework features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computing effort by cutting training compute by over 50\% compared to long thinking models and by reducing inference token usage by over 42\% relative to the base model; and (ii) it improves accuracy by over 9\% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.
Primary Area: generative models
Submission Number: 17497
Loading