Keywords: LLM Reasoning, Code Execution Reasoning
Abstract: Frontier large language models (LLMs) consider reasoning as first-class citizens: they learn to refine their reasoning process and try different strategies during training. Thereby, when prompted, can think through problems and respond better with proper reasoning. For programming tasks, this makes code reasoning a must. In this paper, we propose the task of Code Execution Simulation (CES) as a proxy for evaluating the code reasoning capabilities of LLMs. CES defines the notions of valid or invalid reasoning process, which enables it to promptly (1) determine where the execution simulation diverges from ground truth for incorrect output predictions (essential to understanding limitations of LLMs in code reasoning) and (2) identify suspiciously correct output predictions (essential to understanding reasoning shortcuts, hallucinations, or potential data leakage). In addition to evaluating LLMs’ execution reasoning on a program with a single test, CES measures their reasoning consistency across tests with the same or different prime path coverage. This enables it to evaluate the code reasoning of LLMs in a spectrum: strong, weak, and random. Our results show that LLMs, to a great extent (82.32%), follow a valid reasoning process (results in 30.79% correct and 51.53% incorrect output predictions). However, their reasoning is mostly random (55.59%) or weak (41.69%), which explains their weakness in programming tasksthat require flow- or path-sensitive program analysis to succeed.
Primary Area: causal reasoning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11384
Loading