Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin; Michael Duan; Qin Lin; Aaron Chan; Zhenglun Chen; Junyi Du; Xiang Ren

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Xisen Jin, Michael Duan, Qin Lin, Aaron Chan, Zhenglun Chen, Junyi Du, Xiang Ren

Published: 03 Jun 2026, Last Modified: 14 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Agent, Guardrails, Trusted Execution Environments, Remote Attestation

TL;DR: Proof-of-guardrail enables a remote AI agent to return a cryptographic proof that it actually ran a specific guardrail for a response.

Abstract: As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video:https://anonymous.4open.science/r/Proof-of-Guardrail-Anonymous-680D.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 139

Loading