Keywords: world models, automated program repair, code language models, execution simulation, patch ranking, test validation, software engineering, code world model, agentic systems, LLM for code
TL;DR: We leverage Meta's Code World Model to simulate patch execution without runtime testing, achieving 78.4% Top-1 patch ranking accuracy and 74.2% cost reduction in automated program repair.
Abstract: Automated program repair faces a practical bottleneck: validating candidate
patches requires expensive test execution, yet many plausible patches prove semantically incorrect. We investigate whether world-model-trained code language
models can simulate patch application and test outcomes without runtime execution, enabling efficient patch ranking. Using Meta’s Code World Model (CWM) 32-billion-parameter LLM mid-trained on 120 million Python execution traces
and 3 million agentic Docker trajectories—we propose WorldRepair, an agentic
framework that formulates performance bug repair as sequential decision-making.
Rather than one-shot patch generation followed by exhaustive validation, the agent
iteratively proposes and evaluates patches through simulated execution. On a
dataset of 12,847 real-world Python performance optimizations from 1,847 production repositories, WorldRepair achieves 77.8% ± 1.2% Top-1 patch ranking accuracy (mean ± std over 5 seeds) and reduces test execution costs by 72.1%
compared to exhaustive validation, while maintaining 91.7% agreement between
simulated and actual test outcomes. These results provide initial evidence that
execution trace training helps language models encode program behavior useful
for guiding automated repair, though the 4.9% false discovery rate indicates that
simulated validation should complement—not replace—selective test execution.
Submission Number: 28
Loading