Abstract: We revisit policy-gradient optimization for Large Language Models (LLMs) from a singlestream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines
but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization
barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates
these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker
and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for
every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or
tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally
enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO
converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted
on degenerate groups. Ablation studies confirm that SPO’s gains stem from its principled approach to
baseline estimation and advantage normalization, offering a more robust and efficient path for LLM
reasoning. Across five hard math benchmarks with Qwen3-8B, SPO improves the average maj@32
by +3.4 percentage points (pp) over GRPO, driven by substantial absolute point gains on challenging
datasets, including +7.3 pp on BRUMO 25, +4.4 pp on AIME 25, +3.3 pp on HMMT 25, and achieves
consistent relative gain in pass@𝑘 across the evaluated 𝑘 values. SPO’s success challenges the prevailing
trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles,
not architectural workarounds, drive the next wave of progress in LLM reasoning.
Loading