World Models as Execution Simulators for Automated Program Repair

Published: 02 Mar 2026, Last Modified: 15 Apr 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: world models, automated program repair, code language models, execution simulation, patch ranking, test validation, software engineering, code world model, agentic systems, LLM for code
TL;DR: We leverage Meta's Code World Model to simulate patch execution without runtime testing, achieving 78.4% Top-1 patch ranking accuracy and 74.2% cost reduction in automated program repair.
Abstract: Automated program repair faces a practical bottleneck: validating candidate patches requires expensive test execution, yet many plausible patches prove semantically incorrect. We investigate whether world-model-trained code language models can simulate patch application and test outcomes without runtime execution, enabling efficient patch ranking. Using Meta’s Code World Model (CWM) 32-billion-parameter LLM mid-trained on 120 million Python execution traces and 3 million agentic Docker trajectories—we propose WorldRepair, an agentic framework that formulates performance bug repair as sequential decision-making. Rather than one-shot patch generation followed by exhaustive validation, the agent iteratively proposes and evaluates patches through simulated execution. On a dataset of 12,847 real-world Python performance optimizations from 1,847 production repositories, WorldRepair achieves 77.8% ± 1.2% Top-1 patch ranking accuracy (mean ± std over 5 seeds) and reduces test execution costs by 72.1% compared to exhaustive validation, while maintaining 91.7% agreement between simulated and actual test outcomes. These results provide initial evidence that execution trace training helps language models encode program behavior useful for guiding automated repair, though the 4.9% false discovery rate indicates that simulated validation should complement—not replace—selective test execution.
Submission Number: 28
Loading