Latent Undertow: How Ordinary Typos Break Probes
Keywords: interpretability, activation probes, agents safety
TL;DR: Activation-reading probes break under ordinary typos that LLMs themselves handle fluently; we characterize the geometric mechanism, quantify probe-level fragility, and propose KV-cache and architectural defenses
Abstract: LLM agents increasingly rely on activation-based probes as runtime
monitors for prompt injection, unsafe intent, and policy violations.
Reliability under the noisy text users actually send is a load-bearing
assumption for safe deployment, and we show it is fragile: LLMs handle
ordinary typing variation fluently---a typo or missing punctuation
leaves both user intent and the model's response substantively
unchanged---yet at the activations the probe reads, the same edit
rotates the readout direction by $43^\circ$--$56^\circ$ at the
perturbed token, decaying below $15\%$ within ${\approx}10$ downstream
tokens. Stacking ${\approx}3$ common typos per message cuts a
single-position prompt-injection probe's TPR@FPR$=$1\% by $12.0$pp,
a gap recalibration alone cannot close---a non-adversarial failure
mode for agent-side monitoring. Multi-position aggregation cures
localized perturbations ($\leq 0.5$pp loss) but only attenuates
distributed ones, where even attention- and max-based aggregators
still drop ${\sim}3.8$pp. For single-position probes, we introduce
a KV-cache fork: a short fixed post-user suffix that lets the probe
read past the perturbation, exploiting the rapid spatial decay. This
closes $95\%$ of the gap ($-0.6$pp residual)---an order of magnitude
better than perturbation-augmented training ($-3.7$pp), and deployable
as a drop-in modification of an agent's existing probe pipeline.
Effects replicate on Llama-3.1-8B, Qwen3-8B, and Gemma-4-E4B; probe
evaluation on Llama-3.1-8B.
Code: https://github.com/eladd-ai/latent-undertow.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 101
Loading