LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Stefano Rando; Luca Romani; Alessio Sampieri; Luca Franco; John Yang; Yuta Kyuragi; Fabio Galasso; Tatsunori Hashimoto

LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

Stefano Rando, Luca Romani, Alessio Sampieri, Luca Franco, John Yang, Yuta Kyuragi, Fabio Galasso, Tatsunori Hashimoto

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models (LLMs), Benchmarking, Long-context, Coding

TL;DR: LongCodeBench, a benchmark evaluating long-context language models on real-world coding tasks—code comprehension and repair—across different context lengths up to one million tokens.

Abstract: Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce **LongCodeBench** (**LCB**), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (**LongCodeQA**) and bug fixing (**LongSWE-Bench**) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 474

Loading