CRPS: Curriculum Replay via Progressive Suffixes from Successful Trajectories for Long-Horizon LLM Agents

Published: 05 Apr 2026, Last Modified: 23 Apr 2026ACL 2026 findingsEveryoneCC BY 4.0
Abstract: Long-horizon LLM agents trained with sparse terminal rewards tend to experience slow and unstable learning, and the issue is amplified by group-normalized on-policy objectives commonly used for LLM training (e.g., GRPO). When rollout groups collapse to nearly all failures early in training, within-group normalization yields degenerate advantages and weak learning signals. To address this, we propose Curriculum Replay via Progressive Suffixes from Successful Trajectories (CRPS), a lightweight RL-training strategy that turns serendipitous terminal successes into a within-trajectory curriculum. CRPS maintains a buffer of successful trajectories and restarts rollouts from suffix states, with an online controller adapting k to match agent competence and keep replay outcomes informative. Across ALFWorld and WebShop with different foundation models, CRPS consistently outperforms full-episode GRPO and naive experience replay. Group-level diagnostics further show that CRPS reduces degenerate groups ratio and increases within-group outcome diversity, aligning with faster and more stable training.
Loading