Keywords: Code LLM, Reinforcement Learning
TL;DR: We optimize pass@1 as the sole objective by turning coding-standard diagnostics into per-rule rewards (no tests, no preference pairs), using a frequency aware schedule that yields strong pass@1 gains on a frozen CodeContests+ subset.
Abstract: Large language models for code often pass unit tests yet remain brittle in practice: they may overfit to a test suite, rely on undefined semantics, or fail under small perturbations. Existing RL-based code generation methods optimize rewards from unit test execution, tightly coupling the training signal to a specific test suite. In contrast, we focus on coding standards as the primary source of feedback. We use rules (e.g., MISRA C) as machine-checkable outcomes, converting them into per-rule reward components with a frequency-aware schedule.
During reinforcement learning, the model maximizes this rule-based proxy reward. We hypothesize that enforcing well-established coding rules provides a generalizable training signal that improves both adherence to standards and pass@1 (single-attempt functional success). Motivated by these concerns, we present CodeRule-RL, a reinforcement learning approach that integrates coding rules directly as reward signals for code generation. A frequency-aware curriculum prioritizes frequently violated rules and downweights them as compliance improves. The model, optimizer, data, and prompts remain fixed, with training adjusting only reward weights. Unit tests may appear in prompts for specifications, but they are not executed during training. On the public CodeContests+ C subset, CodeRule-RL achieves higher pass@1 while reducing training wall clock time by more than an order of magnitude compared with RL that executes tests during training. Across 1.5B–7B backbones, it consistently improves both coding-standard compliance and functional success, delivering an 87% relative pass@1 gain.
Primary Area: reinforcement learning
Submission Number: 12991
Loading