Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning for LLMs, Reasoning, Reductive Logic, Benchmark
Abstract: Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model’s genuine reasoning capacity. We propose **D**ynamic **R**easoning **E**fficiency **R**eward (**DRER**) — a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A **Reasoning Quality Reward** assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A **Dynamic Length Advantage** decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release LogicTree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments show that DRER achieves significant improvements in reasoning accuracy and CoT quality over baseline methods across diverse training settings, while also reducing token usage during inference. Moreover, it demonstrates strong generalization on both reasoning and mathematical benchmarks, such as GPQA and AIME24. These results indicate that DRER, as a plug-and-play fine-grained RL reward framework, reliably strengthens reasoning behavior and provides a practical pathway toward enhancing the reasoning capabilities of large language models. All code and data are available in our anonymous repository https://anonymous.4open.science/r/DRER-D34E.
Primary Area: reinforcement learning
Submission Number: 16364
Loading