Self-correcting Reward Shaping via Language Models for Reinforcement Learning Agents in Games

Published: 20 Jun 2025, Last Modified: 22 Jul 2025RLVG Workshop - RLC 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, Language Models, Preference Based, self-correction, reward design, video games
TL;DR: We propose an approach to automatically design rewards for Reinforcement Learning agents through Language Models
Abstract: Reinforcement Learning (RL) in games has gained significant momentum in recent years, enabling the creation of different agent behaviors that can transform a player's gaming experience. However, deploying RL agents in production environments still presents two key challenges: (1) designing an effective reward function typically requires an RL expert, and (2) when a game's content or mechanics are modified, previously tuned reward weights may no longer be optimal. Towards the latter challenge, we propose an automated approach for iteratively fine-tuning an RL agent’s reward function weights, based on a user-defined language based behavioral goal. A Language Model (LM) proposes updated weights at each iteration based on this target behavior and a summary of performance statistics from prior training rounds. This closed-loop process allows the LM to self-correct and refine its output over time, producing increasingly aligned behavior without the need for manual reward engineering. We evaluate our approach in a racing task and show that it consistently improves agent performance across iterations. The LM-guided agents show a significant increase in performance from $9\%$ to $74\%$ success rate in just one iteration. Moreover, we compare our LM‐guided tuning against a human expert’s manual weight design in the racing task: by the final iteration, the LM‐tuned agent achieved an $80\%$ success rate, and completed laps in an average of $855$ time steps, a competitive performance against the expert‐tuned agent’s peak $94\%$ success, and $850$ time steps.
Submission Number: 5
Loading