Do Frontier LLMs Truly Understand Smart Contract Vulnerabilities?

Do Frontier LLMs Truly Understand Smart Contract Vulnerabilities?

ACL ARR 2026 January Submission10562 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Benchmark, Smart Contract Security, Data Contamination, Evaluation

Abstract: Frontier large language models achieve state-of-the-art performance on code understanding benchmarks, yet their true capacity for smart contract security reasoning remains relatively unclear. Can they genuinely reason about vulnerabilities, or merely pattern-match against memorized exploits? We introduce BlockBench, a contamination-controlled benchmark revealing that best-case detection (86.5%) degrades sharply to just 25.3% on uncontaminated samples, suggesting possibilities of substantial surface pattern dependence.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: Resources and Evaluation

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 10562

Loading