Internalizing World Models via Self-Play Finetuning for Agentic RL

Shiqi Chen; Tongyao Zhu; Zian Wang; Jinghan Zhang; Kangrui Wang; Siyang Gao; Teng Xiao; Yee Whye Teh; Junxian He; Manling Li

Internalizing World Models via Self-Play Finetuning for Agentic RL

Shiqi Chen, Tongyao Zhu, Zian Wang, Jinghan Zhang, Kangrui Wang, Siyang Gao, Teng Xiao, Yee Whye Teh, Junxian He, Manling Li

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: World model, Agentic RL

Abstract: Agentic reinforcement learning (RL) has emerged as a primary framework for fine-tuning LLM agents (Wang et al., 2025b; Yao et al., 2023). However, their performance degrades sharply in out-of-distribution (OOD) environments that lie beyond the model’s pretraining data. We identify a systematic divergence between Pass@1—the success rate of the best trajectory—and Pass@k—the success rate across k sampled trajectories. In OOD environments such as Sokoban, FrozenLake, and Sudoku, standard RL consistently reduces Pass@k while only marginally improving Pass@1. This pattern suggests that current agentic RL tends to develop narrow, exploitative policies rather than acquiring the broader world knowledge needed for robust generalization.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 112

Loading