Keywords: Reinforcement Learning, Large Language Models, Policy Optimization, Importance Sampling
TL;DR: ECHO introduces coarse-grained optimization that stabilizes RL training in LLMs while achieving stronger final performance on reasoning tasks.
Abstract: Reinforcement learning (RL) for large language models (LLMs) typically employs token-level clipping of importance sampling ratios to ensure training stability. While effective at preventing catastrophic policy shifts, such fine-grained clipping often excessively truncates learning signals, limiting optimization efficiency.
To address this limitation, we propose ECHO, a novel RL method that combines batch-level clipping with token-level importance sampling. Specifically, ECHO computes an average importance sampling ratio across the entire batch and uses this single clipping bound to modulate the gradient of each token. This batch-level approach preserves richer global reward information while retaining fine-grained token attribution, enabling gradients to capture more holistic reward structures and improve sample efficiency, leading to faster convergence and more stable training. Our method also provides a new perspective on how to define importance sampling ratios and reward shaping in RL for LLMs.
Experimental results on both in-domain Math and reasoning benchmarks demonstrate that ECHO not only accelerates convergence but also achieves highly competitive performance, highlighting its efficiency and robustness for large-scale LLM alignment.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 10653
Loading