Keywords: patch generation, code diffs, diff understanding, dataset, evaluation
Abstract: Reliable handling of code diffs is central to agents that edit and refactor repositories at scale.
We introduce Diff-XYZ, a compact benchmark for code–diff understanding with three supervised tasks: (i) apply (old code + diff $\rightarrow$ new code); (ii) anti-apply (new code – diff $\rightarrow$ old code); (iii) diff generation (new code – old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol.
We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size.
For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code.
Supplementary Material: zip
Submission Number: 49
Loading