From Words to Rewards: Leveraging Natural Language for Reinforcement Learning

Published: 12 Jun 2025, Last Modified: 21 Jun 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Language Modeling
Keywords: Deep Reinforcement Learning, Reward Modeling, Human Feedback, Reward Attribution, preference-based reinforcement learning
Abstract: We explore the use of natural language for specifying rewards in Reinforcement Learning with Human Feedback (RLHF). Human language provides rich and nuanced information, yet most existing approaches rely on simplistic preference data or constrain the text structure. In contrast, we harness the power of Large Language Models (LLMs) to fully leverage natural text to efficiently train a reward model. Our empirical studies with human participants highlight the remarkable benefits of this strategy. Even with minimal human interaction, our method of integrating text feedback with LLMs accurately approximates the reward function and leads to significant performance gains.
Serve As Reviewer: ~Belen_Martin_Urcelay1
Submission Number: 12
Loading