Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

ICLR 2025 Conference Submission560 Authors

13 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language model, code synthesis, reasoning, synthetic data
Abstract: Software engineers mainly write code by editing existing programs. In contrast, language models (LMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of sequential edit data. While high-quality instruction data for code synthesis is already scarce, sequential edit data is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors existing code into sequences of structured code edits by using a linter to procedurally sample across the syntactically interdependent parts of a program. It outputs sampled edit sequences as text strings consisting of consecutive program diffs. To test LintSeq, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we instruction finetune a series of smaller LMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset. We perform comprehensive evaluations comparing LintSeq finetuned models against baselines on HumanEval, MBPP(+), CodeContests, DS-1000, and BigCodeBench. We show that edit sequence finetuned models match or outperform baselines on pass@1 and exhibit better scaling across higher pass@k as a function of total test-time compute. Finally, we also pretrain our own tiny LMs for code understanding. We show that finetuning these models on LintSeq data results in strong performance on HumanEval and MBPP(+) compared to existing code LMs of comparable size.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 560
Loading