Keywords: LLMS, coding benchmarks, evaluation, inverse problems, long-horizon reasoning, systems-level comprehension, robustness evaluation, code corruption, software engineering, long-horizon
TL;DR: We introduce Breakpoint, a method of generating difficult coding tasks for models at a large scale that stress-test its system-level reasoning.
Abstract: Benchmarks for large language models (LLMs) have predominantly assessed short-horizon, localized reasoning.
Existing long-horizon suites (e.g. SWE-lancer) rely on manually curated issues, so expanding or tuning difficulty demands expensive human effort and evaluations quickly saturate.
However, many real-world tasks, such as software engineering or scientific research, require agents to rapidly comprehend and manipulate novel, complex structures dynamically; evaluating these capabilities requires the ability to construct large and varied sets of problems for agents to solve.
We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions within real-world software repositories. Breakpoint systematically controls task difficulty along two different dimensions: local reasoning (characterized by code complexity metrics such as cyclomatic complexity) and system-level reasoning (characterized by call-graph centrality and the number of simultaneously corrupted interdependent functions).
In experiments across more than 900 generated tasks we demonstrate that Breakpoint's methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55\% on the easiest tasks down to 0\% on the hardest. We analyze how static parameters control task difficulty, characterize how improvements in models and inference-time budgets affect local versus system-level reasoning, and evaluate the strategies models use to gather information and iterate on solutions, demonstrating Breakpoint’s effectiveness as a comprehensive evaluation suite for understanding agent behavior and capabilities.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1636
Loading