Group Distributionally Robust Optimization-Driven RL for LLM Reasoning

Published: 02 Mar 2026, Last Modified: 02 Mar 2026MALGAIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Reasoning Models, Reinforcement Learning, Distributionally Robust Optimization, GRPO
Abstract: Reasoning post-training with GRPO is typically built on *static uniformity*: uniform prompt sampling and a fixed number of rollouts per prompt. For heterogeneous, heavy-tailed reasoning data, this wastes compute on already-solved patterns while under-training the long tail of hard problems. We cast GRPO post-training as two independent GDRO games (not coupled) over *dynamic difficulty groups* defined by pass@k evaluation: a *data adversary* that reshapes prompt sampling and a *compute adversary* that redistributes rollouts. Prompt-GDRO applies multiplicative-weights reweighting over bins (with an EMA-debiased difficulty score) to upweight persistently hard groups without frequency bias. Rollout-GDRO allocates rollouts across bins under a fixed mean budget via a shadow-price controller, improving gradient information efficiency on high-uncertainty groups while remaining compute-neutral. Our approach is principled and theory-driven: we provide no-regret guarantees for the Prompt-GDRO game (via an entropy-regularized GDRO surrogate) and a variance-proxy analysis that yields a square-root optimal compute allocation for Rollout-GDRO. On DAPO 14.1k with Qwen3-Base (1.7B/4B/8B), each controller improves pass@8 by 9-13% over GRPO, and diagnostics reveal an emergent curriculum that tracks the evolving reasoning frontier.
Submission Number: 59
Loading