EVOLVING ROLLOUTS: Harnessing Historical Experience for Web Agent Evolution in Reinforcement Learning

Sinuo Wang; WANG PIAOHONG; Tianrui Qin; Maojia Song; Qianben Chen; Qiexiang Wang; Gengze Zhou; Zeyu Zhang; He Zhu; Dingfeng Shi; Yutong Xie; Minghao Liu; Jiaheng Liu; Ge Zhang; Jiawei Ma; Yuchen Eleanor Jiang; Qi Wu; Wangchunshu Zhou

EVOLVING ROLLOUTS: Harnessing Historical Experience for Web Agent Evolution in Reinforcement Learning

Sinuo Wang, WANG PIAOHONG, Tianrui Qin, Maojia Song, Qianben Chen, Qiexiang Wang, Gengze Zhou, Zeyu Zhang, He Zhu, Dingfeng Shi, Yutong Xie, Minghao Liu, Jiaheng Liu, Ge Zhang, Jiawei Ma, Yuchen Eleanor Jiang, Qi Wu, Wangchunshu Zhou

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Agentic reinforcement learning (RL) for web search is prohibitively expensive due to long context lengths and costly environment interactions, and this inefficiency is further exacerbated by group-based optimization, which discards learning signals from entire rollout groups with zero reward variance. In this work, we propose EVOLVING ROLLOUTS, an RL framework for web-search agents that moves beyond episodic training and distills collected rollouts into in-context guidance for future policy behavior. By extracting the reward-labeled trajectories into strategic experiences, our method augments standard parameter-space optimization with implicit context-space optimization guided by prior experience. This enables the agent to recover learning signals from zero-variance rollouts, thereby fostering co-evolution between the policy and the experience repository. EVOLVING ROLLOUTS improves sample efficiency and task performance across representative web search benchmarks, with Qwen3-8B surpassing the much larger Qwen3-30B-A3B in average performance across GAIA, xBench, and HLE, and Qwen3-4B attaining comparable results on GAIA and HLE.

Lay Summary: AI assistants are increasingly trained to browse the web and answer hard research questions, but teaching them this way is expensive. The standard training method has a blind spot: when the AI tries a task several times and either always succeeds or always fails, the algorithm discards those attempts as having nothing to teach, wasting most of the costly practice. Our method, Evolving Rollouts, fixes this by giving the agent a growing experience repository alongside its usual training. After every batch of attempts, we summarize what worked and what didn't, and save those insights as reusable strategies that it can look up next time.

Originally Submitted Supplementary Material: gz

Primary Area: Applications->Language, Speech and Dialog

Keywords: Web Agent, Reinforcement Learning, Self-evolving Agent

Originally Submitted PDF: pdf

Submission Number: 15393

Loading