Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

Rethinking Reasoning Quality in Large Language Models through Enhanced Chain-of-Thought via RL

ICLR 2026 Conference Submission16364 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning for LLMs, Reasoning, Reductive Logic, Benchmark

Abstract: Reinforcement learning (RL) has recently become the dominant paradigm for strengthening the reasoning abilities of large language models (LLMs). Yet the rule-based reward functions commonly used on mathematical or programming benchmarks assess only answer format and correctness, providing no signal as to whether the induced Chain-of-Thought (CoT) actually improves the answer. Furthermore, such task-specific training offers limited control over logical depth and therefore may fail to reveal a model’s genuine reasoning capacity. We propose **D**ynamic **R**easoning **E**fficiency **R**eward (**DRER**) — a plug-and-play RL reward framework that reshapes both reward and advantage signals. (i) A **Reasoning Quality Reward** assigns fine-grained credit to those reasoning chains that demonstrably raise the likelihood of the correct answer, directly incentivising the trajectories with beneficial CoT tokens. (ii) A **Dynamic Length Advantage** decays the advantage of responses whose length deviates from a validation-derived threshold, stabilising training. To facilitate rigorous assessment, we also release LogicTree, a dynamically constructed deductive reasoning dataset that functions both as RL training data and as a comprehensive benchmark. Experiments show significant improvements in inference accuracy and logical consistency over the baseline methods at equal training steps, while the average confidence of CoT-augmented answers rises by 30%. The model further exhibits generalisation across diverse logical-reasoning datasets, and the mathematical benchmark AIME24. These results illuminate how RL shapes CoT behaviour and chart a practical path toward enhancing formal-reasoning skills in large language models. All code and data are available in our anonymous repository https://anonymous.4open.science/r/DRER-D34E.

Primary Area: reinforcement learning

Submission Number: 16364

Loading