Keywords: Reinforcement Learning, Large Language Model, Meta-learning, Self-evolution
TL;DR: We propose a bi-level differentiable evolutionary framework that discovers interpretable reward functions for RL training of LLMs.
Abstract: The design of reward functions presents an arduous challenge in reinforcement learning (RL). Existing automated reward modeling typically relies on derivative-free evolutionary heuristics that treat the reward function as a black box, failing to capture the causal relationship between structure changes and task performance. To bridge this gap, we propose $\textbf{Differentiable Evolutionary Reinforcement Learning (DERL)}$, a bi-level training framework for autonomous discovery of optimal reward signal. In DERL, a $\textit{Meta-Optimizer}$ evolves a reward function by composing structured atomic primitives, guiding the evolution of inner-loop policy. Crucially, DERL is differentiable in meta-optimization—updating the Meta-Optimizer via policy gradient derived from inner-loop validation performance. This allows the progressively learning of the ''meta-gradient'' of task success for denser and more actionable feedback. We validate DERL across three distinct domains: robotic agent (ALFWorld), scientific simulation (ScienceWorld), and mathematical reasoning (GSM8k, MATH). Results show that DERL achieves state-of-the-art performance on agent benchmarks, significantly outperforming non-differentiable methods, especially in out-of-distribution scenarios. Analysis of the evolutionary trajectory demonstrates that DERL successfully captures the intrinsic structure of tasks, enabling self-improving agent alignment without human intervention.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading