Latent Undertow: How Ordinary Typos Break Probes

Published: 11 Jun 2026, Last Modified: 18 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Methods (probing, steering, causal interventions), Feature Geometry, Interpretability for AI Safety
TL;DR: Activation-reading probes break under ordinary typos that LLMs themselves handle fluently; we characterize the geometric mechanism, quantify probe-level fragility, and propose KV-cache and architectural defenses
Abstract: LLMs handle ordinary typing variation fluently: a typo or missing punctuation leaves both user intent and the model's response substantively unchanged. Yet probes that detect malicious prompts by reading the model's hidden states tell a different story: the same edit rotates the readout vector by $43^\circ$--$56^\circ$ at the perturbed token, decaying below $15\%$ within ${\approx}10$ downstream tokens. Stacking ${\approx}3$ common typos per message cuts a single-position prompt-injection probe's TPR@FPR$=$1\% by $12.0$pp, a gap recalibration alone cannot close. Multi-position aggregation cures localized perturbations ($\leq 0.5$pp loss) but only attenuates distributed ones, where even attention- and max-based aggregators still drop ${\sim}3.8$pp. For single-position probes, we introduce a KV-cache fork: a short fixed suffix appended after the user message lets the probe read a few tokens downstream of the perturbation, exploiting its rapid spatial decay. This closes $95\%$ of the gap ($-0.6$pp residual)---an order of magnitude better than perturbation-augmented training ($-3.7$pp). The rotation-and-decay geometry replicates on Llama-3.1-8B, Qwen3-8B, and Gemma-4-E4B; probe evaluation is on Llama-3.1-8B. Code: https://github.com/eladd-ai/latent-undertow
Submission Number: 129
Loading