An Empirical Study of Reasoning Steps in Thinking Code LLMs

Haoran Xue; Gias Uddin; Song Wang

An Empirical Study of Reasoning Steps in Thinking Code LLMs

Haoran Xue, Gias Uddin, Song Wang

Published: 28 Mar 2026, Last Modified: 07 May 2026AIware 2026EveryoneRevisionsCC BY 4.0

Keywords: Large Language Models, Large Reasoning Models, Code Generation

Abstract: Thinking Large Language Models (LLMs) generate explicit intermediate reasoning traces before final answers, potentially improving transparency, interpretability, and solution accuracy for code generation. However, it remains unclear which trace characteristics are informative and the quality of the reasoning chains. In this paper, we present an empirical study examining the reasoning processes and the quality of thinking LLMs on code generation tasks. We evaluate six state-of-the-art reasoning LLMs (DeepSeek-R1, OpenAI-o3-mini, Claude-3.7-Sonnet, Gemini-2.0-Flash-Thinking, Gemini-2.5-Flash, and Qwen-QwQ) on 100 BigCodeBench code generation tasks (600 model–task instances; 3,772 reasoning steps). To characterize reasoning-chain structure, we measure step count and per-step verbosity, and compare successful versus failed attempts under difficulty stratification (Hard vs. Non-Hard). We further perform a 21-participant human evaluation of reasoning quality across three dimensions: efficiency, logical consistency, and completeness, and we build a taxonomy of problematic reasoning patterns. We find the model- and difficulty-dependent relationship between step count and success, and verbosity is not a reliable correctness signal. Human analysis indicates that completeness issues dominate failures (44.5%), most often due to missed edge cases and boundary conditions, and incompleteness is a stronger predictor of failure on Hard tasks than on Non-Hard tasks (𝜌 = −0.219 vs. 𝜌 = −0.096).

Revision Summary: We have made the following modifications for our final version according to the reviewers' comments: 1. We have added the 95% confidence intervals for Hard-split correlations to address the concern about the small sample size of Hard-split tasks. 2. We have added clarification about the concern that the unified structured prompt may introduce bias in the results. 3. We have added a description of the BigCodeBench dataset to better clarify the task. We thank reviewers again for their valuable comments.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public.

Paper Type: Full-length papers (i.e. case studies, theoretical, applied research papers). 8 pages

Reroute: false

Submission Number: 33

Loading