Keywords: intrinsic dimension, internal representations, LLMs, jailbreaks
TL;DR: We show that the intrinsic dimension of prompt-level token representations peaks in early–middle layers, increases under shuffling and correlates with surprisal. A simple linear probe based on ID flags malicious vs benign prompts with high accuracy.
Abstract: We study the geometry of token representations at the prompt level in large language models through the lens of intrinsic dimension. Viewing transformers as mean-field particle systems, we estimate the intrinsic dimension of the empirical measure at each layer and demonstrate that it correlates with next-token uncertainty. Across models and intrinsic dimension estimators, we find that intrinsic dimension peaks in early to middle layers and increases under semantic disruption (by shuffling tokens), and that it is strongly correlated with average surprisal, with a simple analysis linking logits geometry to entropy via softmax. As a case study in practical interpretability and safety, we train a linear probe on the per-layer intrinsic dimension profile to distinguish malicious from benign prompts before generation. This probe achieves 90–95\% accuracy across different datasets, outperforming widely used guardrails such as Llama Guard and Gemma Shield. We further compare against linear probes built from layerwise entropy derived via the Tuned Lens and find that the intrinsic dimension-based probe is competitive and complementary, offering a compact, interpretable signal distributed across layers. Our findings suggest that prompt-level geometry provides actionable signals for monitoring and controlling LLM behavior, and offers a bridge between mechanistic insights and practical safety tools.
Primary Area: interpretability and explainable AI
Submission Number: 17592
Loading