BIN-BENCH: Can LLM Agents Reason Through Long-Horizon Binary Analysis?

BIN-BENCH: Can LLM Agents Reason Through Long-Horizon Binary Analysis?

ACL ARR 2026 January Submission9105 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmark, Binary analysis

Abstract: Most existing benchmarks evaluate LLM agents by whether they succeed or fail, but cannot show how reasoning breaks down over many steps. We introduce BIN-BENCH, a benchmark of 520 real-world binary files where agents must reason through dozens of steps, and early mistakes cause later failures that cannot be easily fixed. We propose metrics that analyze complete reasoning traces to see how agents explore binaries, remember information, and where their reasoning goes wrong. Our evaluation shows that LLM agents struggle to maintain accurate understanding over long reasoning sequences, and fail in ways that success/failure metrics cannot reveal. We find that agents face fundamental tradeoffs: being consistent in reasoning makes it harder to recover from early mistakes, while being flexible to correct errors makes it harder to maintain coherent understanding. Adding planning, reflection, or error correction mechanisms helps only slightly. Our findings suggest that improving long-horizon reasoning requires better LLM capabilities and context management, not just more sophisticated control strategies. An anonymized version of the benchmark and related artifacts is available at https://anonymous.4open.science/r/anonymous-A-4BDE/.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: agent evaluation, long-horizon reasoning, benchmarks

Contribution Types: Data resources, Data analysis

Languages Studied: Binary Code

Submission Number: 9105

Loading