Submission Type: Regular Long Paper
Submission Track: NLP Applications
Keywords: Code Edit, Instruction Finetuning
TL;DR: InstructCoder is the first dataset designed to adapt LLMs for general code editing.
Abstract: Code editing encompasses a variety of pragmatic tasks that developers deal with daily. Despite its relevance and practical usefulness, automatic code editing remains an underexplored area in the evolution of deep learning models, partly due to data scarcity. In this work, we explore the use of large language models (LLMs) to edit code based on user instructions, covering a broad range of implicit tasks such as comment insertion, code optimization, and code refactoring. To facilitate this, we introduce CodeInstruct, the first dataset designed to adapt LLMs for general-purpose code editing, containing high-diversity code-editing tasks. It consists of over 114,000 instruction-input-output triplets and covers multiple distinct code editing scenarios. The dataset is systematically expanded through an iterative process that commences with code editing data sourced from GitHub commits as seed tasks. Seed and generated tasks are used subsequently to prompt ChatGPT for more task data. Our experiments demonstrate that open-source LLMs fine-tuned on CodeInstruct can edit code correctly based on users' instructions most of the time, exhibiting unprecedented code-editing performance. Such results suggest that proficient instruction-finetuning can lead to significant amelioration in code-editing abilities.
Submission Number: 261
Loading