Latent Undertow: How Ordinary Typos Break Probes

Published: 23 May 2026, Last Modified: 03 Jun 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: interpretability, activation probes, agents safety
TL;DR: Activation-reading probes break under ordinary typos that LLMs themselves handle fluently; we characterize the geometric mechanism, quantify probe-level fragility, and propose KV-cache and architectural defenses
Abstract: LLM agents increasingly rely on activation-based probes as runtime monitors for prompt injection, unsafe intent, and policy violations. Reliability under the noisy text users actually send is a load-bearing assumption for safe deployment, and we show it is fragile: LLMs handle ordinary typing variation fluently---a typo or missing punctuation leaves both user intent and the model's response substantively unchanged---yet at the activations the probe reads, the same edit rotates the readout direction by $43^\circ$--$56^\circ$ at the perturbed token, decaying below $15\%$ within ${\approx}10$ downstream tokens. Stacking ${\approx}3$ common typos per message cuts a single-position prompt-injection probe's TPR@FPR$=$1\% by $12.0$pp, a gap recalibration alone cannot close---a non-adversarial failure mode for agent-side monitoring. Multi-position aggregation cures localized perturbations ($\leq 0.5$pp loss) but only attenuates distributed ones, where even attention- and max-based aggregators still drop ${\sim}3.8$pp. For single-position probes, we introduce a KV-cache fork: a short fixed post-user suffix that lets the probe read past the perturbation, exploiting the rapid spatial decay. This closes $95\%$ of the gap ($-0.6$pp residual)---an order of magnitude better than perturbation-augmented training ($-3.7$pp), and deployable as a drop-in modification of an agent's existing probe pipeline. Effects replicate on Llama-3.1-8B, Qwen3-8B, and Gemma-4-E4B; probe evaluation on Llama-3.1-8B. Code: https://github.com/eladd-ai/latent-undertow.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 101
Loading