Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs

Ziyu Ye; Rishabh Agarwal; Tianqi Liu; Rishabh Joshi; Sarmishta Velury; Quoc V Le; Qijun Tan; Yuan Liu

Reward-Guided Prompt Evolving in Reinforcement Learning for LLMs

Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury, Quoc V Le, Qijun Tan, Yuan Liu

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We introduce a new RLHF training method where language models can adaptively prioritize and create useful prompts during training guided by reward signals.

Abstract: Existing reinforcement learning (RL) methods for large language models (LLMs) rely on static prompt sets, where prompts are curated a priori, and sampled in a fixed schedule for training, regardless of their usefulness to the RL process. We design `eva`, the first method that allows LLMs to prioritize and adaptively create useful prompts during RL training by reward signals. In principle, `eva` (Evolving via A symmetric Self-Play) casts language model training as a game between: (1) a creator, who samples and generates training prompts, and (2) a solver, who generates responses to the prompts. `eva` is simple, suits both offline and online RL for LLMs, and sets a new state-of-the-art on challenging benchmarks without extra human prompts: it improves gemma-2-9b-it’s win-rate on Arena-Hard from 51.6% to 60.1% by DPO and 52.6% to 62.4% by RLOO, surpassing claude-3-opus and nearing gemini-1.5-pro, both are orders of magnitude larger. Further ablation studies show `eva` can induce meaningful learning curriculum, and effectively scale RL for LLMs beyond static human prompts.

Lay Summary: Large language models are usually trained using a fixed set of prompts. But not all prompts are equally useful, and this static setup limits how much the model can improve. Our method, called `eva`, lets the model not only respond to prompts but also create new ones based on which ones help it learn best. This turns training into a kind of game: one part of the model proposes problems, and another part tries to solve them well. By using feedback from rewards, `eva` learns which prompts are most helpful, evolves better ones, and gradually builds a smarter training path. This leads to significantly better results -- even outperforming models that are much larger or trained with more human data. We show that by letting models generate and select their own training data, they can improve more efficiently. `eva` is simple to use, works in both offline and online training settings, and is a step toward more adaptive and general-purpose AI systems.

Primary Area: Deep Learning->Large Language Models

Keywords: large language models

Submission Number: 8003

Loading