Internalizing World Models via Self-Play Finetuning for Agentic RL

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: World model, Agentic RL
Abstract: Agentic reinforcement learning (RL) has emerged as a primary framework for fine-tuning LLM agents (Wang et al., 2025b; Yao et al., 2023). However, their performance degrades sharply in out-of-distribution (OOD) environments that lie beyond the model’s pretraining data. We identify a systematic divergence between Pass@1—the success rate of the best trajectory—and Pass@k—the success rate across k sampled trajectories. In OOD environments such as Sokoban, FrozenLake, and Sudoku, standard RL consistently reduces Pass@k while only marginally improving Pass@1. This pattern suggests that current agentic RL tends to develop narrow, exploitative policies rather than acquiring the broader world knowledge needed for robust generalization.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 112
Loading