From Words To Rewards: Leveraging Natural Language For Reinforcement Learning

Published: 21 Jan 2026, Last Modified: 21 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We explore the use of natural language to specify rewards in Reinforcement Learning with Human Feedback (RLHF). Unlike traditional approaches that rely on simplistic preference feedback, we harness Large Language Models (LLMs) to translate rich text feedback into state-level labels for training a reward model. Our empirical studies with human participants demonstrate that our method accurately approximates the reward function and achieves significant performance gains with fewer interactions than baseline methods.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Changed wording of 4th contribution.
Code: https://github.com/BelenMU/WordsToRewards
Assigned Action Editor: ~Vimal_Thilak2
Submission Number: 6283
Loading