Verifiability-First Agents: Provable Observability and Lightweight Audit Agents for Controlling Autonomous LLM Systems
Keywords: Agentic AI, Verifiable Agents, Autonomous LLM Agents, AI Accountability
TL;DR: We introduce a Verifiability-First Architecture that makes LLM agents provably accountable using cryptographic action attestations and audit ensembles, cutting detection time by 25% and boosting attribution confidence to 0.85.
Abstract: As LLM-based agents grow more autonomous and multi- modal, ensuring they remain controllable, auditable, and faithful to deployer intent becomes critical. Prior benchmarks measured propensity for misaligned behavior and showed that agent personalities and tool access significantly influ- ence misalignment. Building on those insights, we propose a Verifiability-First architecture that (1) integrates run-time attestations of agent actions (cryptographic & symbolic), (2) embeds lightweight Audit Agents that continuously verify in- tent vs. behavior using constrained reasoning, and (3) en- forces challenge–response attestation protocols for high-risk operations. We introduce OPERA (Observability, Provable Execution, Red-team, Attestation), a benchmark suite and evaluation protocol designed to measure (i) detectability of misalignment, (ii) time-to-detection under stealthy strategies, and (iii) resilience of verifiability mechanisms to adversar- ial prompt/persona injection. Our approach aims to shift the evaluation focus from ”how likely misalignment is” to ”how quickly and reliably misalignment can be detected and reme- diated.”
Submission Number: 69
Loading