Keywords: LLM Agents, Multi-turn Reinforcement Learning, Long-Horizon
Abstract: Finetuning large language model (LLM) agents with multi-turn reinforcement learning (RL) is a promising direction. However, applying multi-turn RL to long-horizon agentic tasks presents unique challenges not typically encountered in reasoning tasks such as solving math problems. These include long interaction histories that hinder relevant context retrieval, sparse rewards that slow down learning, and variable trajectory lengths that reduce training efficiency. To address these challenges, we propose Verlog, a framework that incorporates:
(1) customizable agent memory mechanism, allowing the agent to flexibly include different lengths of historical interaction in each turn’s prompt based on task requirements, and
(2) Dual-discounting GAE, which decouples step-level and token-level credit assignment.
(3) trajectory early truncation, which reduces GPU idle time and boosts multi-turn RL training efficiency.
Experiments demonstrate that our method surpasses the zero-shot performance of state-of-the-art LLMs across three benchmarks, BabyAI, BabaIsAI and Crafter, while also achieving greater efficiency and effectiveness than variants lacking either the memory mechanism or dual-discounting GAE. Notably, Verlog is the first framework capable of training LLM agents on trajectories exceeding 400 turns, demonstrating scalability far beyond prior approaches.
Submission Number: 131
Loading