Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Evgeniy Glukhov; Michele Conti; Egor Bogomolov; Yaroslav Golubev; Alex Bezzubov

Diff-XYZ: A Benchmark for Evaluating Diff Understanding

Evgeniy Glukhov, Michele Conti, Egor Bogomolov, Yaroslav Golubev, Alex Bezzubov

Published: 22 Sept 2025, Last Modified: 25 Nov 2025DL4C @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: patch generation, code diffs, diff understanding, dataset, evaluation

Abstract: Reliable handling of code diffs is central to agents that edit and refactor repositories at scale. We introduce Diff-XYZ, a compact benchmark for code–diff understanding with three supervised tasks: (i) apply (old code + diff $\rightarrow$ new code); (ii) anti-apply (new code – diff $\rightarrow$ old code); (iii) diff generation (new code – old code $\rightarrow$ diff). Instances in the benchmark are triples $\langle \textit{old code}, \textit{new code}, \textit{diff} \rangle$ drawn from real commits in CommitPackFT, paired with automatic metrics and a clear evaluation protocol. We use the benchmark to do a focused empirical study of the unified diff format and run a cross-format comparison of different diff representations. Our findings reveal that different formats should be used depending on the use case and model size. For example, representing diffs in search-replace format is good for larger models in the diff generation scenario, yet not suited well for diff analysis and smaller models. The Diff-XYZ benchmark is a reusable foundation for assessing and improving diff handling in LLMs that can aid future development of diff formats and models editing code.

Supplementary Material: zip

Submission Number: 49

Loading