Keywords: agent, code reasoning, code auditing, bug detection
TL;DR: An alignment-based approach that empowers LLMs to detect complex software bugs through structured reasoning.
Abstract: Software auditing is an increasingly critical task in the era of rapid code generation. While LLM-based auditors have demonstrated strong potential, their effectiveness remains limited by misalignment with the highly complex, domain-specific nature of bug detection. In this work, we introduce BugScope, an inference-time alignment framework inspired by human auditing practices. BugScope structures auditing into three steps: seed identification, context retrieval, and bug detection, and aligns LLMs to each step by analyzing real bug reports, generating diverse examples, and distilling concise, reusable guidelines. On a curated dataset of 40 real-world bugs from 21 widely used open-source projects,
BugScope achieves 86.05% precision and 87.88% recall, corresponding to an F1 score of 0.87. By comparison, leading industrial tools such as Cursor BugBot and CodeRabbit achieve F1 scores of only 0.43 and 0.29, respectively. Beyond benchmarks, large-scale evaluation on real-world projects such as the Linux kernel uncovered 184 previously unknown bugs, of which 78 have already been fixed and 7 explicitly confirmed by developers.
Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)
Submission Number: 9750
Loading