Demystify the Potential of Large Language Models as World Models of Code

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, World Model, Code Execution
TL;DR: We tested whether AI language models can predict code execution results without actually running the code, finding they show promise but still have significant limitations.
Abstract: A key frontier for Large Language Models (LLMs) is the development of internal world models that simulate and reason about the dynamics of an environment, which is foundational for their evolution from text generators into systems with deeper reasoning abilities. Code, with its structured logic and deterministic execution, serves as an ideal domain to study and cultivate such models. This work investigates a fundamental question: Can LLMs serve as world models of code, predicting program outcomes without actual execution? To systematically probe this potential, we introduce **WoC**, a holistic benchmark with $1160$ problems covering $8$ key aspects: multi-language programming, competition-level problems, repository-level analysis, high-cost scientific computing, time-complexity-intensive algorithms, buggy code analysis, environment-dependent programs, and formal mathematical proof verification. Through extensive analysis of $21$ open-source and proprietary LLMs, we examine scaling laws, data efficiency, and predictive accuracy. Our findings offer crucial insights into the capabilities, limitations, and scaling properties of LLMs as world models of code, a critical step towards building more general and robust computational reasoning systems.
Supplementary Material: zip
Submission Type: Benchmark Paper (4-9 Pages)
Submission Number: 47
Loading