Caught in the Act(ivation): Stopping Credential Exfiltration Before It Starts

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: prompt injection, credential exfiltration, LLM agents, agentic security, honeytoken, differential privacy, conformal prediction, mechanistic interpretability, mutual information, multi-turn attacks
TL;DR: We present the Agentic Immune System (AIS), the first defense against LLM credential exfiltration that is simultaneously pre-output, formally grounded, and temporally aware, achieving 0.94 detection at 0.005 FPR across 2,439 benchmark attempts.
Abstract: LLM agents leak credentials when untrusted inputs manipulate them via indirect prompt injection. Existing defenses share three structural gaps: text-level output monitoring is evaded by Base64 encoding or per-turn character dripping; honeytoken schemes provide no formal indistinguishability guarantee; and every deployed defense evaluates one turn in isolation, leaving covert multi-turn exfiltration invisible. We present the Agentic Immune System (AIS), a unified framework that closes all three gaps through strictly complementary components. CIFT (Causal Information Flow Tracking) monitors transformer activations at credential positions before any token is emitted, achieving encoding-invariant detection at 0.998 AUROC with under 1 ms overhead. ε-DECEPTION generates honeytokens via a Laplace-noised bigram mechanism with a formal ε-indistinguishability bound, wrapped in a conformal calibration layer that eliminates threshold tuning entirely. NIMBUS (Neural Information Monitoring Budget) enforces a cumulative mutual-information bit-budget across turns via an InfoNCE critic, detecting 90% of 50 simulated 20-turn covert conversations at zero false-block rate. The full AIS achieves 0.94 detection at 0.005 FPR across 2,439 benchmark attempts, Pareto-dominating every baseline. For closed-weight deployments where activation access is unavailable, an API-deployable configuration (ε-DECEPTION + NIMBUS) achieves 0.71 detection at 0.008 FPR. AIS is, to our knowledge, the first defense simultaneously pre-output, formally grounded, and temporally aware.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 93
Loading