TL;DR: We apply reinforcement learning to ground LLMs in execution feedback for effective multi-turn code generation.
Abstract: Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
Lay Summary: We observe that Large language models (LLMs) struggle when deployed as agents, that is, when they perform a user-requested task autonomously and have to rely on feedback from automatic systems such as a compiler, unit tests, or website responses. Our hypothesis is that performance is low because the models cannot effectively incorporate this automatic feedback. In other words, they lack the necessary grounding. Achieving this grounding is the goal of our paper.
The preferred way to achieve the desired grounding in automatic feedback is train an LLM in the target domain. This means that we require a learning environment that produces this automatic feedback. We select multi-turn code generation for this, which works as follows: the LLM is first presented with a coding challenge. The reply is evaluated against a small set of tests ("public tests"), and if any of these fail, corresponding feedback is provided and the LLM can propose an updated or new solution. This process is repeated several times. We then train an LLM with reinforcement learning, with the learning objective to produce correct final code as judged by a more complete set of tests ("private tests").
We show that, after reinforcement learning, models produce correct code solutions more reliably and can better incorporate feedback from unit tests. In fact, they obtain top performance relative to their size (recent, orthogonal work on reasoning with LLMs has shown significant gains on coding questions as well). Our conclusion is thus that models should be trained -- with reinforcement learning -- on the domains they will face in deployments.
Primary Area: Deep Learning->Large Language Models
Keywords: large language models, multi-turn code generation, reinforcement learning, LLM agents
Submission Number: 13473
Loading