Keywords: LLM Agents, Multi-turn Reinforcement Learning
Abstract: Large Language Models (LLMs) hold promise as autonomous agents but remain limited on long-horizon, sparse-reward tasks, where achieving goals requires extended planning and precise action sequences. These challenges arise from two fronts: algorithmically, sparse feedback destabilizes reinforcement learning; system-wise, variance in rollout lengths causes severe GPU underutilization. Asynchronous training improves efficiency but introduces off-policyness, risking unstable reinforcement learning with LLMs. We propose Verlog, a framework for efficient multi-turn RL with LLM agents. Verlog reduces rollout variance through early truncation and per-turn asynchronous rollouts, while stabilizing training with a dual-discounted GAE and pretrained value function. We provide the first systematic analysis of the ``off-policy tax" in asynchronous training frameworks, quantifying when policy staleness undermines performance. On BabyAI, BabaIsAI, and Crafter benchmarks, Verlog demonstrates substantial improvements in both computational throughput and task success rates, remaining stable and efficient on trajectories exceeding 400 turns where prior frameworks typically destabilize beyond 10 turns.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 22873
Loading