Think Like You Execute: Verifiable Chain of Thought from Program Traces

Shailja Thakur; Vaibhav Saxena; Rohan Kulkarni; Shivdeep Singh; Parameswaran Selvam; Hiroshi Kanayama; Hima Patel

Think Like You Execute: Verifiable Chain of Thought from Program Traces

Shailja Thakur, Vaibhav Saxena, Rohan Kulkarni, Shivdeep Singh, Parameswaran Selvam, Hiroshi Kanayama, Hima Patel

Published: 18 Apr 2026, Last Modified: 25 Apr 2026ACL 2026 Industry Track OralEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain-of-Thought Reasoning, Code Reasoning, Execution Traces, Synthetic Data Generation, Verification, Data Quality, Code LLMs, Bi-directional Reasoning

Abstract: Teaching language models to reason about code execution is still an open problem. Current synthetic Chain-of-Thought (CoT) training data often consists of plausible-sounding explanations generated by teacher models, not verifiable accounts of actual program behavior. This causes models to learn logically flawed reasoning patterns despite syntactic correctness. We address this by grounding CoT generation directly in program execution traces. Our pipeline instruments code to capture dynamic behavior, narrates execution traces into natural language, and actively verifies each rationale against the trace. We systematically create 54,000 execution-verified, bi-directional rationales that teach models to reason both forward (input$\rightarrow$output) and backward (output$\rightarrow$input). Models fine-tuned on our verified data achieve substantial improvements, with a performance boost of +24.2 on LiveCodeBench-Exec, +22.3 on CruxEval-Output, and +21.1 on CruxEval-Input, demonstrating that verification quality directly determines both reasoning and code generation capabilities.

Submission Type: Deployed

Copyright Form: pdf

Submission Number: 164

Loading