Verlog: An Efficient Synchronized Multi-turn RL Framework for LLM Agents

Wentse Chen; Jiayu Chen; Hao Zhu; Fahim Tajwar; Ruslan Salakhutdinov; Jeff Schneider

Verlog: An Efficient Synchronized Multi-turn RL Framework for LLM Agents

Wentse Chen, Jiayu Chen, Hao Zhu, Fahim Tajwar, Ruslan Salakhutdinov, Jeff Schneider

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agents, Multi-turn Reinforcement Learning

Abstract: Large Language Models (LLMs) hold promise as autonomous agents but remain limited on long-horizon, sparse-reward tasks, where achieving goals requires extended planning and precise action sequences. These challenges arise from two fronts: algorithmically, sparse feedback destabilizes reinforcement learning; system-wise, variance in rollout lengths causes severe GPU underutilization. Asynchronous training improves efficiency but introduces off-policyness, risking unstable reinforcement learning with LLMs. We propose Verlog, a framework for efficient multi-turn RL with LLM agents. Verlog reduces rollout variance through early truncation and per-turn asynchronous rollouts, while stabilizing training with a dual-discounted GAE and pretrained value function. We provide the first systematic analysis of the ``off-policy tax" in asynchronous training frameworks, quantifying when policy staleness undermines performance. On BabyAI, BabaIsAI, and Crafter benchmarks, Verlog demonstrates substantial improvements in both computational throughput and task success rates, remaining stable and efficient on trajectories exceeding 400 turns where prior frameworks typically destabilize beyond 10 turns.

Supplementary Material: zip

Primary Area: reinforcement learning

Submission Number: 22873

Loading