Keywords: Methods (probing, steering, causal interventions), Feature Geometry, Interpretability for AI Safety
TL;DR: Activation-reading probes break under ordinary typos that LLMs themselves handle fluently; we characterize the geometric mechanism, quantify probe-level fragility, and propose KV-cache and architectural defenses
Abstract: LLMs handle ordinary typing variation fluently: a typo or missing
punctuation leaves both user intent and the model's response
substantively unchanged. Yet probes that detect malicious prompts
by reading the model's hidden states tell a different story: the
same edit rotates the readout vector by $43^\circ$--$56^\circ$ at
the perturbed token, decaying below $15\%$ within ${\approx}10$
downstream tokens. Stacking ${\approx}3$ common typos per message
cuts a single-position prompt-injection probe's TPR@FPR$=$1\% by
$12.0$pp, a gap recalibration alone cannot close. Multi-position
aggregation cures localized perturbations ($\leq 0.5$pp loss) but
only attenuates distributed ones, where even attention- and
max-based aggregators still drop ${\sim}3.8$pp. For single-position
probes, we introduce a KV-cache fork: a short fixed suffix appended
after the user message lets the probe read a few tokens downstream
of the perturbation, exploiting its rapid spatial decay. This
closes $95\%$ of the gap ($-0.6$pp residual)---an order of
magnitude better than perturbation-augmented training ($-3.7$pp).
The rotation-and-decay geometry replicates on Llama-3.1-8B,
Qwen3-8B, and Gemma-4-E4B; probe evaluation is on Llama-3.1-8B.
Code: https://github.com/eladd-ai/latent-undertow
Submission Number: 129
Loading