Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

Code Reasoning for Software Engineering Tasks: A Survey and A Call to Action

TMLR Paper7356 Authors

05 Feb 2026 (modified: 18 Jun 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The rise of large language models (LLMs) has led to dramatic improvements across a wide range of natural language tasks. Their performance on certain tasks can be further enhanced by incorporating test-time reasoning techniques. These inference-time advances have been adopted into the code domain, enabling complex software engineering (SWE) tasks such as code generation, test generation and issue resolution. However, the impact of different reasoning techniques on code-centric SWE tasks has not been systematically explored. In this work, we survey code reasoning techniques that underpin these capabilities, with a focus on test-time compute and inference-time reasoning paradigms. We examine a variety of code-specific reasoning methods and progressively build up to SWE agents, which combine planning, tool use, and multi-step interaction. We also compare the impact of different techniques on coding tasks, highlighting their relative importance and outlining open challenges and future research directions. Across commonly used models and benchmarks, we find that approaches exploiting code-specific signals (e.g., structure and execution feedback) are frequently associated with improved performance, motivating a dedicated study of code reasoning beyond natural-language reasoning. Our contributions are: (1) to the best of our knowledge, the first dedicated survey of code reasoning for SWE tasks, highlighting overarching reasoning strategies, hybrid methods, and agentic approaches; (2) a taxonomy of inference-time techniques used to drive code reasoning, accompanied by a curated set of under-explored benchmarks with high potential for SWE evaluation; (3) a comparative analysis of reasoning design patterns across commonly used models and benchmarks; and (4) a synthesis of gaps in current methods and evaluation practices, identifying under-explored areas and concrete opportunities for future research.

Submission Type: Long submission (more than 12 pages of main content)

Previous TMLR Submission Url: https://openreview.net/forum?id=DKr2davhKA

Changes Since Last Submission: Version 1: The previous submission was desk rejected due to adjusted font; this has since been fixed. Version 2: The changes made according to the reviewers' suggestions are presented in red text in the updated PDF. **Version 3 (camera-ready):** Camera-ready version addressing the Action Editor’s minor-revision requests. - We made a final editorial pass over Sections 6–7 to better align the prose with the cautious framing used in the observation boxes. In particular, we removed remaining broad generalizations and clarified that the comparative claims refer to the benchmarks studied in the survey. - We added more precise search dates to the survey methodology, clarifying that the initial search was conducted from February to May 2025, with a further update in December 2025. The exact query strings were already included in the previous version. - We cleaned up the remaining PlanSearch category-label inconsistency in Table 3 and added a footnote clarifying its categorization. - We corrected minor typos, internal section references, and figure-caption formatting issues.

Code: https://github.com/AI4Code-IBM-Columbia/code-reasoning-for-swe-tasks

Assigned Action Editor: ~quanming_yao1

Submission Number: 7356

Loading