One Step Forward, Two Steps Back: Regression Errors and Cost Inefficiencies in LLM Iterative Refinement for Code Generation

Published: 02 Mar 2026, Last Modified: 08 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: Large Language Models, Code Generation, Self-Reflection, Self-Refine, Iterative Refinement, Regression Errors
TL;DR: Self-Refine not only fails to significantly improve Pass@1 scores for code generation, it frequently introduces regression errors; this instability worsens with smaller critics, which paradoxically drive up inference costs.
Abstract: Self-Refine, a self-reflection framework, has proven effective for improving Large Language Model performance in stylistic tasks, such as code readability. However, its utility for ensuring functional correctness in code generation is often undermined by significant reliability and cost trade-offs. In this work, we characterize the specific failure modes of Self-Refine in this domain and analyze the impact of "Asymmetric Self-Refine" (utilizing smaller, cost-efficient models as critics) to mitigate high inference costs of iterative refinement. Evaluating on the LiveCodeBench dataset, we uncover a critical negative result: the framework fails to reliably improve Pass@1 scores and often degrades them due to hallucination-induced regression errors. Critics hallucinated flaws in functionally correct code, prompting the actor to introduce bugs into valid solutions. This instability, present even in symmetric configurations, is exacerbated by smaller critics, which drive total inference costs higher due to excessive refinement loops, contradicting the expectation of cost savings.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading