R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design. However, code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency. To address the challenges, we propose an efficient automated reward design framework, called R*, which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R* maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation. To design more efficient reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels. These labeled segments are then used to refine the reward function parameters through preference learning. Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.
Lay Summary: High-quality reward functions are a prerequisite for stable and efficient reinforcement learning, yet crafting them manually is labor-intensive and error-prone. Recent attempts to let large language models (LLMs) write rewards still struggle, because naïve searches over the vast design space converge slowly and often miss good parameter choices. We introduce R*, an automated framework that separates reward design into two coordinated steps: reward-structure evolution and parameter-alignment optimisation. First, a population of modular reward functions is evolved with LLM-driven mutation and module-level crossover, reusing useful code blocks while encouraging diversity. Second, multiple LLM-generated critic functions compare short trajectory segments; a voting scheme retains only high-confidence labels, making parameter tuning data-efficient and fully automatic. Alignment operates on segments where at least three of five critics agree, ensuring reliable supervision without human intervention. Across eight robotic-manipulation benchmarks from Isaac Gym and Dexterity, the rewards produced by R* let agents learn faster and achieve higher final success than Eureka, the previous state-of-the-art. Because the entire loop—from structure search to parameter optimisation and critic labelling—runs automatically, R* turns reward shaping from an expert art into a repeatable pipeline. These advances could shorten the path from research code to reliable factory or household robots that learn new tasks safely and quickly.
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reward Design, Reinforcement Learning
Submission Number: 12111
Loading