Keywords: large language models, chain of thought, code execution
TL;DR: We introduce ET-CoT, an approach where LLMs are fine-tuned on systematic program execution traces to learn to predict code outcomes by generating these traces as a chain of thought.
Track: Short Paper (up to 4 pages)
Abstract: Programmatic representations constitute policies, reward functions, environment models, and skill libraries for autonomous agents. However, their practical value hinges on large language models (LLMs) that can understand and reason about code, not merely generate it. A crucial aspect of this reasoning is the ability of LLMs to predict the outcome of the code (or ``execute'' it), a critical yet less developed area. Improving this capability is essential for verifiable policies, self-auditing reward functions, and debuggable environment models within program-centric agents.
To address this, we propose \emph{ET-CoT (Execution Trace Chain of Thought)}, an approach where LLMs learn to generate a detailed and systematic program execution trace as a chain of thought to predict program outcomes. Taking Python as an example, we designed a program-execution trace format inspired by recent theoretical advances. Next, we developed a new Python interpreter called \emph{PyTracify}, which outputs these traces during execution. We then generated a large number of traces and fine-tuned an LLM using them. This ET-CoT approach allows the LLMs to execute Python programs consistently by generating the trace as a CoT. Specifically, our fine-tuned model outperforms other models of comparable size on code execution benchmarks such as CRUXEval-O and LiveCodeBench.
Format: We have read the camera-ready instructions, and our paper is formatted with the provided template.
De-Anonymization: This submission has been de-anonymized.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 11
Loading