Stop Harm Before It Starts: iCRAFT for Agentic LLM Governance

Stop Harm Before It Starts: iCRAFT for Agentic LLM Governance

AAAI 2026 Workshop AIGOV Submission39 Authors

21 Oct 2025 (modified: 25 Nov 2025)AAAI 2026 Workshop AIGOV SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI governance, agentic LLMs, input-side policy, safety guardrails, auditability, healthcare, pre-execution controls

TL;DR: iCRAFT enforces input-side policy for agentic LLMs, blocking harmful prompts before generation while preserving accuracy and auditability.

Abstract: AI assistants that plan and call tools create new governance needs. We present iCRAFT, a software architecture framework that enforces policy at request ingress before any model generation or tool use and records auditable evidence. We implement the input side subset: minimal protected health information (PHI) scrubbing, a small set of documented patterns for clearly disallowed requests, a whitelist for obviously benign intents, a lightweight ALLOW/REFUSE safety classifier, and an approval rule for high-risk actions. All decisions (trigger, outcome, latency) are logged to a versioned knowledge repository. Using three model tiers, we evaluate on standardized slices of MedMCQA and MedQA (utility) and JailbreakBench (adversarial) in classification-only mode. Enabling the gate leaves medical QA accuracy unchanged (no significant difference), while blocking 90-94\% of clearly harmful prompts before generation with 3-7\% residual risk. Benign blocks at a policy-strict setting are 20-30\% and can be reduced by adjusting whitelist scope and classifier calibration. Latency is negligible for rules/whitelist and 0.63-0.80s only when classification runs. The results show that early, input-side policy enforcement can reduce exposure to unsafe behavior, work across models and vendors, and produce audit-ready artifacts supporting governance by design.

Submission Number: 39

Loading