Automated Judging of LLM-based Smart Contract Security Auditors

Published: 01 Jan 2025, Last Modified: 27 Sept 2025ICBC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The rise of Large Language Model-based smart contract auditors, supplementing manual audits, lacks standardized evaluation methods. This creates considerable uncertainty about their reliability and effectiveness in identifying vulnerabilities. This gap is particularly critical as smart contract exploits continue to result in billion-dollar losses and the complexity of blockchain systems grows exponentially. Our research introduces smartJudge, a comprehensive evaluation framework that systematically assesses the capabilities of LLM-based smart contract auditors through multi-dimensional analysis. To develop smartJudge, we first created an agenticRAG architecture that leverages agents to acquire in-depth knowledge and build expertise on smart contract auditing, vulnerabilities, and best practices. We then developed a distilled LLM judge model that efficiently processes auditor outputs through strategic evaluation points to evaluate how well auditors detect vulnerability patterns, understand security research, and identify complex vulnerability chains. Finally, we constructed a benchmark suite with evaluation metrics providing a standardized way to measure auditor performance across diverse vulnerability types. Testing smartJudge on the representative LLM-based auditors revealed critical gaps in vulnerability detection capabilities, particularly in complex attack vectors and novel exploit patterns. Our framework provides insights for improving automated auditing tools and establishes the first systematic methodology for evaluating LLM-based smart contract auditors, addressing a crucial bottleneck in blockchain security tools.
Loading