Transforming Language Models into Program Interpreters via Execution Trace Chain of Thought

14 Sept 2025 (modified: 10 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: large language models, chain of thought, code execution
TL;DR: We introduce ET-CoT, an approach where LLMs are fine-tuned on systematic program execution traces to learn to predict code outcomes by generating these traces as a chain of thought.
Abstract: Code execution reasoning (CER), the ability to predict code execution on a given input, has emerged as an important aspect of language models' (LMs) coding capabilities. However, many open-source small- to medium-sized LMs continue to perform poorly on simple code snippets, and effective methodologies to enhance CER capability have not yet been established. In this context, we first highlight the limitations of LMs in basic operations in CER. Through our custom tests, including a test that measures the understanding of individual grammar rules, we indicate that code understanding in natural language does not imply actual procedural understanding of code, and that it is necessary to accumulate reasoning steps at a granularity finer than a line in a structured manner. Motivated by these insights, we investigate ET-CoT (Execution Trace Chain of Thought), a method in which execution traces are generated with our custom code interpreter PyTracify and used as chain-of-thought rationales, in order to transform 8B-class LMs to code interpreters specialized for CER. After fine-tuning with 127k examples, we demonstrate the effectiveness of ET-CoT, improving Qwen2.5-7B-Instruct to $70.0\%$ on CruxEval-O and to $88.3\%$ on LiveCodeBench (execution), thereby setting new baselines for the class.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 5076
Loading