Keywords: Large Language Models, Benchmark, Smart Contract Security, Data Contamination, Evaluation
Abstract: Frontier large language models achieve state-of-the-art performance on code understanding benchmarks, yet their true capacity for smart contract security reasoning remains relatively unclear. Can they genuinely reason about vulnerabilities, or merely pattern-match against memorized exploits? We introduce BlockBench, a contamination-controlled benchmark revealing that best-case detection (86.5%) degrades sharply to just 25.3% on uncontaminated samples, suggesting possibilities of substantial surface pattern dependence.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: Resources and Evaluation
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 10562
Loading