Multi-Turn Code Generation Through Single-Step Rewards

Arnav Kumar Jain; Gonzalo Gonzalez-Pumariega; Wayne Chen; Alexander M Rush; Wenting Zhao; Sanjiban Choudhury

Multi-Turn Code Generation Through Single-Step Rewards

Arnav Kumar Jain, Gonzalo Gonzalez-Pumariega, Wayne Chen, Alexander M Rush, Wenting Zhao, Sanjiban Choudhury

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards. We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards. Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn. $\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code. Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.

Lay Summary: We want agents that can generate correct code for us but doing so in one try can be difficult without unit test feedback, so we focus on multi-turn code generation where an agent can iteratively refine its solution using execution feedback. However, training agents with such feedback (correct/incorrect) using reinforcement learning is challenging due to sparse rewards signals which makes learning inefficient. Our work introduces $\mu$Code, a simple and scalable approach to make this process more effective. First, we observe that a correct code solution can be generated at any step, meaning the agent can "recover" in a step — we call this *one-step recoverability*. Second, instead of relying on sparse rewards, we *learn a verifier* to provide a richer score to make learning easier. These insights allow us to reduce the problem from complex reinforcement learning to imitation learning, making training more stable. In addition, by learning a verifier we can output multiple solutions and choose the highest scoring solution during inference-time in a *multi-turn Best-of-N search*. We release our models for generating code and verifying code for researchers to contribute to the self-improving model community. Developing stronger generators and verifiers in conjunction will produce agents that are stronger and more reliable at code generation over multiple steps.

Link To Code: https://github.com/portal-cornell/muCode

Primary Area: Deep Learning->Large Language Models

Keywords: Code Generation, Language Models, Reinforcement Learning

Submission Number: 15056

Loading