Keywords: Persistent Homology, Interpretability, Topological Data Analysis, Representation Geometry, Large Language Models, AI Security, Adversarial Attacks, Sparse Autoencoders
TL;DR: We use persistent homology to interpret how adversarial inputs reshape LLM representation spaces, resulting in a robust signature that provides multiscale, geometry-aware insights complementary to standard interpretability methods.
Abstract: Existing interpretability methods for Large Language Models (LLMs) often fall short by focusing on linear directions or isolated features, overlooking the high-dimensional, nonlinear, and relational geometry within model representations. This study focuses on how adversarial inputs systematically affect the internal representation spaces of LLMs, a topic which remains poorly understood. We propose the application of persistent homology (PH) to measure and understand the geometry and topology of the representation space when the model is under external adversarial influence. Specifically, we use PH to systematically interpret six state-of-the-art models under two distinct adversarial conditions—indirect prompt injection and backdoor fine-tuning—and uncover a consistent topological signature of adversarial influence. Across architectures and model sizes, adversarial inputs induce "topological compression'', where the latent space becomes structurally simpler, collapsing from varied, compact, small-scale features into fewer, dominant, and more dispersed large-scale ones. This topological signature is statistically robust across layers, highly discriminative, and provides interpretable insights into how adversarial effects emerge and propagate. By quantifying the shape of activations and neuron-level information flow, our architecture-agnostic framework reveals fundamental invariants of representational change, offering a complementary perspective to existing interpretability methods.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 17657
Loading