R*: Efficient Reward Design via Reward Structure Evolution and Parameter Alignment Optimization with Large Language Models
Abstract: Reward functions are crucial for policy learning. Large Language Models (LLMs), with strong coding capabilities and valuable domain knowledge, provide an automated solution for high-quality reward design.
However,
code-based reward functions require precise guiding logic and parameter configurations within a vast design space, leading to low optimization efficiency.
To address the challenges,
we propose an efficient automated reward design framework, called R*,
which decomposes reward design into two parts: reward structure evolution and parameter alignment optimization. To design high-quality reward structures, R* maintains a reward function population and modularizes the functional components. LLMs are employed as the mutation operator, and module-level crossover is proposed to facilitate efficient exploration and exploitation.
To design more efficient reward parameters, R* first leverages LLMs to generate multiple critic functions for trajectory comparison and annotation. Based on these critics, a voting mechanism is employed to collect the trajectory segments with high-confidence labels.
These labeled segments are then used to refine the reward function parameters through preference learning.
Experiments on diverse robotic control tasks demonstrate that R* outperforms strong baselines in both reward design efficiency and quality, surpassing human-designed reward functions.
Lay Summary: High-quality reward functions are a prerequisite for stable and efficient reinforcement learning, yet crafting them manually is labor-intensive and error-prone. Recent attempts to let large language models (LLMs) write rewards still struggle, because naïve searches over the vast design space converge slowly and often miss good parameter choices.
We introduce R*, an automated framework that separates reward design into two coordinated steps: reward-structure evolution and parameter-alignment optimisation. First, a population of modular reward functions is evolved with LLM-driven mutation and module-level crossover, reusing useful code blocks while encouraging diversity. Second, multiple LLM-generated critic functions compare short trajectory segments; a voting scheme retains only high-confidence labels, making parameter tuning data-efficient and fully automatic. Alignment operates on segments where at least three of five critics agree, ensuring reliable supervision without human intervention. Across eight robotic-manipulation benchmarks from Isaac Gym and Dexterity, the rewards produced by R* let agents learn faster and achieve higher final success than Eureka, the previous state-of-the-art. Because the entire loop—from structure search to parameter optimisation and critic labelling—runs automatically, R* turns reward shaping from an expert art into a repeatable pipeline.
These advances could shorten the path from research code to reliable factory or household robots that learn new tasks safely and quickly.
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reward Design, Reinforcement Learning
Submission Number: 12111
Loading