Keywords: reinforcement learning, large language models, reasoning, uncertainty
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent and contextually relevant text. These models
arguably lack the ability to logically reason, an essential skill required to solving mathematical problems and programming tasks.
While step-by-step prompting approaches show some promise, they often depend on finding a suitable prompt tailored to the specific model and task. In this work, we propose a simple, yet an effective approach to enhance reasoning capabilities by leveraging reinforcement learning (RL) and the confidence scores of a well-calibrated LLM. It involves optimising an implicit reward derived from the model's confidence levels in the answer to the reasoning task at hand.
We generate preference data and fine-tune the LLM in a similar spirit to reinforcement learning from human feedback (RLHF), but without needing any human provided labels or preferences.
Our results show that resulting reasoning abilities of an LLM improve and are transferable to other reasoning tasks. This warrants further investigation of RL as a facilitator for solving complex language tasks.
Primary Area: reinforcement learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10605
Loading