D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff

Ulyana Piterbarg; Kanishk Gandhi; Lerrel Pinto; Noah Goodman; Rob Fergus

D3: A Large Dataset for Training Code Language Models to Act Diff-by-Diff

Ulyana Piterbarg, Kanishk Gandhi, Lerrel Pinto, Noah Goodman, Rob Fergus

Published: 06 Mar 2025, Last Modified: 30 Apr 2025ICLR 2025 Workshop Data Problems PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, data filtering, synthetic data, code synthesis

Abstract: We introduce D3 ("Diverse Data for Diff-by-Diff Coding"), a large dataset for training LMs to iteratively synthesize general-purpose Python source code by generating file diffs. D3 frames code synthesis as a goal-conditioned sequential decision-making problem, where goals, states, and actions are represented by token sequences corresponding to the description of a functionality to add, the current contents of a file, and a file diff, respectively. The dataset contains 8 billion tokens of instruction + file-state + file-diff-sequence examples sampled from 850,000 human-written Python source files. To construct D3, we filter, augment, and annotate source code from The Stack by sampling synthetic file-diff sequences with a code analysis tool and labeling each sample with an LLM-generated rationale. In our experiments, we show that mid-training LMs like Llama 3.2 1b and 3b on D3 prior to supervised fine-tuning (SFT) on task-curated data improves performance on synthesis & editing tasks. On benchmarks like HumanEvalSynth and HumanEvalFix, we observe improvements in pass@1 of 3 to 6 points compared to direct SFT. D3-trained models are particularly strong at completing partial human-written solutions to programming problems.

Submission Number: 59

Loading