BenchName: a Set of Benchmarks for Long-Context Code Models

ICLR 2026 Conference Submission13437 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: code generation, code completion, code summarization, bug localization, diff summarization, evaluation, datasets
TL;DR: Collection of benchmarks for code processing models working with project-level context
Abstract: The fields of code and natural language processing are evolving rapidly, with models becoming better at processing long context windows --- supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of comprehensive benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing BenchName, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines are available in the manuscript.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13437
Loading