SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems — A Case Study on Ethereum Clients

Masato Kamba; Akiyoshi Sannai

SPECA: Specification-to-Checklist Agentic Auditing for Multi-Implementation Systems — A Case Study on Ethereum Clients

Masato Kamba, Akiyoshi Sannai

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0

Keywords: llm agent, multi implementation security

TL;DR: We present SPECA, a specification-to-checklist audit framework that scales security reviews across implementations and reveals both its strengths in cross-implementation reuse and its current limits on complex stateful bugs.

Abstract: Multi-implementation systems (e.g., protocols with independent clients) are increasingly audited against natural-language specifications. Differential testing scales well when implementations disagree, but it provides little signal when all implementations converge on the same incorrect interpretation of an ambiguous requirement. We present SPECA, a \textbf{SPE}cification-to-\textbf{C}hecklist \textbf{A}uditing framework that turns normative requirements into property-based checklists, maps them to implementation locations, and supports cross-implementation reuse. We instantiate SPECA in an in-the-wild security audit contest for the Ethereum Fusaka upgrade, covering 11 production clients. Across 54 submissions, 17 were judged valid by the contest organizers. Cross-implementation checks account for 76.5\% (13/17) of valid findings, suggesting that checklist-derived 1$\rightarrow$N reuse is a practical scaling mechanism in multi-implementation audits. To understand false positives, we manually coded the 37 invalid submissions and find that threat model misalignment explains 56.8\% (21/37): reports that rely on assumptions about trust boundaries or scope that contradict the audit's rules. We also characterize key gaps: we detected no High/Medium findings in the V1 deployment, with misses concentrated in specification details and implicit assumptions (57.1\%), timing and concurrency issues (28.6\%), and external library dependencies (14.3\%). Our improved agent, evaluated against the ground truth of a competitive audit, achieved a strict recall of 27.3\% on high-impact vulnerabilities. This performance places the agent in the top 4\% of human auditors, outperforming 49 out of 51 contestants in detecting critical issues. While this evidence comes from a single deployment, it highlights a general systems-security lesson: formalizing threat assumptions early is critical for reducing false positives and focusing agentic auditing effort. Our approach demonstrates significant efficiency gains; the agent-driven process distills the audit, allowing human experts to validate findings and finalize a submission in just 40 minutes on average.

PDF: pdf

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 230

Loading