Execution-Grounded Credit Assignment for GRPO in Code Generation

Published: 03 Mar 2026, Last Modified: 17 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Post-Training, RLVR, GRPO, RL
TL;DR: EGCA localizes GRPO credit assignment to the earliest execution divergence between a generated program and a reference trace, replacing uniform sequence-level updates with span-targeted gradients for near-correct code.
Abstract: Critic-free reinforcement learning with verifiable rewards (RLVR) improves code generation by optimizing unit-test pass rates, but GRPO-style updates suffer from coarse credit assignment: a single outcome signal is spread uniformly across long programs even when failure stems from a localized semantic error. We propose \textbf{Execution-Grounded Credit Assignment (EGCA)}, which localizes GRPO updates using execution traces. For programs that satisfy algorithmic constraints but fail tests, EGCA executes the candidate and a canonical reference solution (curated once offline; used for analysis, not supervision) under identical instrumentation, identifies the earliest semantic divergence, and assigns advantage only to the corresponding token span while masking downstream tokens. EGCA is a drop-in modification requiring no critic, auxiliary loss, or learned verifier, yielding 82.1\% pass@1 on HumanEval (+3.1 over GRPO) and 68.9\% on MBPP (+1.5) with 18\% wall-clock overhead.
Submission Number: 48
Loading