Keywords: Code LLM, Reinforcement Learning
TL;DR: We optimize pass@1 as the sole objective by turning coding-standard diagnostics into per-rule rewards (no tests, no preference pairs), using a frequency aware schedule that yields strong pass@1 gains on a frozen CodeContests+ subset.
Abstract: Large language models for code often pass unit tests yet remain brittle in practice. They may overfit to a test suite, rely on undefined semantics, or fail under small perturbations. We use the coding standard as training guidance and keep unit tests outside the training loop. Each rule provides a machine-checkable outcome that we convert into per-rule reward components with a simple frequency-aware schedule. The only optimization target is higher pass@1 (single attempt functional success).
We present CodeRule-RL, a reinforcement learning approach that optimizes pass@1 as the sole objective and uses coding standard feedback only as auxiliary guidance. Rule outcomes are converted into per-rule reward components, and a simple frequency aware curriculum prioritizes rules that are violated most often and reduces their weight as compliance improves. The model, optimizer, data, and prompts remain fixed. Training adjusts only reward weights. Unit tests may appear in prompts to express specifications, but they are not executed during training.
On the public CodeContests+ C subset, CodeRule-RL attains higher pass@1 while reducing training wall clock time by more than one order of magnitude compared with RL that executes tests during training. Across 1.5B–7B backbones, it consistently improves functional success, delivering a relative pass@1 gain of 87\%.
Primary Area: reinforcement learning
Submission Number: 12991
Loading