NextCoder: Robust Adaptation of Code LMs to Diverse Code Edits

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper introduce a synthetic data generation pipeline and a robust model adaptation algorithm to train models for diverse code-editing tasks without losing their original code generation abilities.
Abstract: Software engineering activities frequently involve edits to existing code. However, contemporary code language models (LMs) lack the ability to handle diverse types of code-edit requirements. In this work, we attempt to overcome this shortcoming through (1) a novel synthetic data generation pipeline and (2) a robust model adaptation algorithm. Starting with seed code examples and diverse editing criteria, our pipeline generates high-quality samples comprising original and modified code, along with natural language instructions in different styles and verbosity. Today's code LMs come bundled with strong abilities, such as code generation and instruction following, which should not be lost due to fine-tuning. To ensure this, we propose a novel adaptation algorithm, SeleKT, that (a) leverages a dense gradient-based step to identify the weights that are most important for code editing, and (b) does a sparse projection onto the base model to avoid overfitting. Using our approach, we obtain a new series of models NextCoder (adapted from QwenCoder-2.5) that achieves strong results on five code-editing benchmarks, outperforming comparable size models and even several larger ones. We show the generality of our approach on two model families DeepSeekCoder and QwenCoder), compare against other fine-tuning approaches, and demonstrate robustness by showing retention of code generation and general problem-solving abilities post adaptation. We opensource the models, synthetic dataset, and implementation at http://aka.ms/nextcoder.
Lay Summary: We often need to update or change computer programs by editing their code, but today’s AI code assistants struggle to handle the wide variety of edits people want. In this work, we created two key innovations to help: a method for generating lots of realistic code-edit examples and a smart way to train models on them without losing their existing abilities. First, we built a pipeline that takes sample code (written in different styles) to generate many high-quality code-edit examples. Then, we designed an approach called SeleKT that carefully adjusts the model’s weights, focusing only on the parts most useful for editing while keeping its general programming skills intact. Our adapted models, called NextCoder, performed impressively on multiple code-editing tests, even beating larger models. We’re sharing the models, dataset, and tools at http://aka.ms/nextcoder so others can explore and use them for their tasks.
Link To Code: http://aka.ms/nextcoder
Primary Area: Deep Learning->Large Language Models
Keywords: Code-LMs, code-editing, code-generation, software engineering
Submission Number: 15174
Loading