FaultLoc: Evaluating Coding Agents For Fault Localization
Keywords: llm, agent, cybersecurity, evaluation, coding
TL;DR: We propose FAULTLOC, a large-scale security benchmark for coding agents on fault localization.
Abstract: Coding agents powered by large language models have demonstrated strong performance on security tasks such as vulnerability discovery and automated patching. However, existing benchmarks primarily evaluate patch generation through functional tests, allowing agents to succeed by mitigating symptoms at the crash site without identifying the underlying root cause. As a result, it remains unclear whether these systems perform genuine code reasoning or rely on superficial fixes. We introduce FAULTLOC, a benchmark designed to evaluate fault localization in large codebases. Given a bug report and a proof-of-concept input, agents must identify the code region responsible for the vulnerability, with predictions evaluated against developer patches. FAULTLOC comprises 900 real-world vulnerabilities with verified PoCs and fixes across widely-used open-source projects. Our evaluation shows that even frontier agents fail to recover the developer-identified root-cause region in most cases, achieving only 40% overlap with ground-truth buggy lines and frequently mislocalizing vulnerabilities at the file level. These results highlight the need for agents that reason about causal structure rather than merely suppress observable failures.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 248
Loading