TL;DR: Prompting with high-diversity examples of fixes boosts code-repair performance.
Abstract: Scaling up inference-time compute has proven to be a valuable strategy in improving the performance of Large Language Models (LLMs) without fine-tuning. An important task that can benefit from additional inference-time compute is self-repair; given an initial flawed response or guess, the LLM corrects its own mistake and produces an improved response or fix. We leverage the in-context learning ability of LLMs to perform self-repair in the coding domain. The key contribution of our paper is an approach that synthesises and selects an ordered set of golden example pairs, or AuPairs, of these initial guesses and subsequent fixes for the corresponding problems. Each such AuPair is provided as a single in-context example at inference time to generate a repaired solution. For an inference-time compute budget of $N$ LLM calls per problem, $N$ AuPairs are used to generate $N$ repaired solutions, out of which the highest-scoring solution is the final answer. The underlying intuition is that if the LLM is given a different example of fixing an incorrect guess each time, it can subsequently generate a diverse set of repaired solutions. Our algorithm selects these AuPairs in a manner that maximises complementarity and usefulness. We demonstrate the results of our algorithm on 5 LLMs across 7 competitive programming datasets for the code repair task. Our algorithm yields a significant boost in performance compared to best-of-$N$ and self-repair, and also exhibits strong generalisation across datasets and models. Moreover, our approach shows stronger scaling with inference-time compute budget compared to baselines.
Lay Summary: Large Language Models (LLMs) have become increasingly better at writing code over the past few years, but they still often make mistakes. Our aim was to help make LLMs better at fixing their own coding errors, not just by passing in the first attempt and asking it to do better, but by providing a diverse range of examples of corrections, just like a human might think of multiple ways to fix an issue.
To address this, we developed an algorithm called AuPair, which curates an ordered set of golden example pairs. An example pair contains an initial flawed piece of code along with an improved version. Our algorithm yields golden example pairs that are complementary and diverse, thus boosting performance. During inference, a different golden example pair is given in each LLM call to fix a problem's solution, exposing the LLM to different types of fixes, thereby guiding it towards a better solution.
Our work shows that selecting good in-context examples, or more generally, carefully curated in-context data can serve as a powerful inference technique that can outperform inference-time algorithms like best-of-$N$ or self-repair. While we have focused on code repair, AuPair is a general approach that is compatible with a large range of domains and tasks.
Primary Area: Deep Learning->Large Language Models
Keywords: LLM, Coding, In-context learning
Submission Number: 5222
Loading