TL;DR: Persistent homology captures distinct topological signatures of LLM representations under adversarial influence across different model architectures and sizes.
Abstract: Internal model representations are central to driving interpretability in machine learning and understanding them is key to reliability. Using persistent homology (PH), a technique from topological data analysis that captures the shape and structure of data at multiple scales, we present a global and local characterization of the latent space of three state-of-the-art Large Language Models (LLMs) under two adversarial conditions. Through a layer-wise topological analysis, we show that adversarial interventions consistently compress the latent space, reducing topological diversity at smaller scales while amplifying prominent structures at higher scales. Critically, these topological signatures are statistically meaningful and remain consistent across model architectures and sizes. We further introduce a novel neuron-level interpretability framework where PH is used to quantify information flow within and across layers. Our results establish PH as a powerful tool for interpretability in LLMs and for detecting distinct operational modes under adversarial influence.
Primary Area: Deep Learning->Large Language Models
Keywords: Topological Data Analysis, Persistent Homology, Large Language Models, Interpretability, Representations, Prompt Injection, Backdoors
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Submission Number: 7108
Loading