Keywords: Benchmark, Binary analysis
Abstract: Most existing benchmarks evaluate LLM agents by whether they succeed or fail, but cannot show how reasoning breaks down over
many steps. We introduce BIN-BENCH, a benchmark of 520 real-world binary files where agents must reason through dozens of steps,
and early mistakes cause later failures that cannot be easily fixed. We propose metrics that analyze complete reasoning traces to see how agents explore binaries, remember information, and where their reasoning goes wrong. Our evaluation shows that LLM agents struggle to maintain accurate understanding over long reasoning sequences, and fail in ways that success/failure metrics cannot reveal. We find that agents face fundamental tradeoffs: being consistent in reasoning makes it harder to recover from early mistakes, while being flexible to correct errors makes it harder to maintain coherent understanding. Adding planning, reflection, or error correction mechanisms helps only slightly. Our findings suggest that improving long-horizon reasoning requires better LLM capabilities and context management, not just more sophisticated control strategies. An anonymized version of the benchmark and related artifacts is available at https://anonymous.4open.science/r/anonymous-A-4BDE/.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: agent evaluation, long-horizon reasoning, benchmarks
Contribution Types: Data resources, Data analysis
Languages Studied: Binary Code
Submission Number: 9105
Loading