From Words To Rewards: Leveraging Natural Language For Reinforcement Learning

TMLR Paper6283 Authors

22 Oct 2025 (modified: 02 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We explore the use of natural language to specify rewards in Reinforcement Learning with Human Feedback (RLHF). Unlike traditional approaches that rely on simplistic preference feedback, we harness Large Language Models (LLMs) to translate rich text feedback into state-level labels for training a reward model. Our empirical studies with human participants demonstrate that our method accurately approximates the reward function and achieves significant performance gains with fewer interactions than baseline methods.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We reviewed the manuscript to address the reviewer R64J's comments as well. Changes appear in blue.
Assigned Action Editor: ~Vimal_Thilak2
Submission Number: 6283
Loading