From Words to Rewards: Leveraging Natural Language for Reinforcement Learning

Belen Martin Urcelay; Andreas Krause; Giorgia Ramponi

From Words to Rewards: Leveraging Natural Language for Reinforcement Learning

Belen Martin Urcelay, Andreas Krause, Giorgia Ramponi

Published: 12 Jun 2025, Last Modified: 07 Jul 2025EXAIT@ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Language Modeling

Keywords: Deep Reinforcement Learning, Reward Modeling, Human Feedback, Reward Attribution, preference-based reinforcement learning

Abstract: We explore the use of natural language for specifying rewards in Reinforcement Learning with Human Feedback (RLHF). Human language provides rich and nuanced information, yet most existing approaches rely on simplistic preference data or constrain the text structure. In contrast, we harness the power of Large Language Models (LLMs) to fully leverage natural text to efficiently train a reward model. Our empirical studies with human participants highlight the remarkable benefits of this strategy. Even with minimal human interaction, our method of integrating text feedback with LLMs accurately approximates the reward function and leads to significant performance gains.

Serve As Reviewer: ~Belen_Martin_Urcelay1

Submission Number: 12

Loading