Anchored Self-Play for Code Repair

Caroline Choi; Zeyneb N. Kaya; Shirley Wu; Tengyu Ma; Tatsunori Hashimoto; Ludwig Schmidt

Anchored Self-Play for Code Repair

Caroline Choi, Zeyneb N. Kaya, Shirley Wu, Tengyu Ma, Tatsunori Hashimoto, Ludwig Schmidt

Published: 05 Mar 2026, Last Modified: 14 Mar 2026ICLR 2026 Workshop RSI SpotlightEveryoneRevisionsCC BY 4.0

Keywords: self-play, code repair, synthetic data, reinforcement learning

Abstract: Code repair is an important capability for language models (LMs): given a buggy program and unit tests, an LM must produce a fixed program that passes the tests. Because curated code-repair data is limited, we study whether supervision can be scaled by having an LM generate bug--fix tasks with unconstrained edits, using unit tests for verification. We propose *generator-fixer self-play*, which trains a single model with reinforcement learning to alternate between generating bugs and fixing them. As the fixer improves, the generator adapts to produce more challenging bugs, yielding an automatic curriculum. However, because unit tests certify correctness but not realism, it is unclear whether training on these synthetically-generated bugs improves repair on bugs encountered in practice. To measure this realism gap, we introduce BugSourceBench, which evaluates repair across diverse bug sources: human-authored bugs, errors in LM-generated code, and human edits of buggy LM-generated code. On BugSourceBench, we find that generator--fixer self-play improves repair on its self-generated bugs while degrading on human-authored bugs. We propose Anchored Self-Play (ASP), which anchors self-play to a small reference set drawn from the target bug sources. ASP (i) shapes generation with a code-embedding similarity reward and (ii) mixes reference bugs into fixer training to stabilize learning as the generator evolves. Across sources, ASP achieves the best fix rates, improving the average fix rate by $+25$ pp (relative) / $+7.2$ pp (absolute) over standard self-play, with gains on both LM-originated bugs ($+100$ pp (relative) / $+11$ pp (absolute)) and human-authored bugs ($+7.1$ pp (relative) / $+3.4$ pp (absolute)).

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 117

Loading