Holes in Latent Space: Topological Signatures Under Adversarial Influence

Aideen Fay; Inés García-Redondo; Qiquan Wang; Haim Dubossarsky; Anthea Monod

Holes in Latent Space: Topological Signatures Under Adversarial Influence

Aideen Fay, Inés García-Redondo, Qiquan Wang, Haim Dubossarsky, Anthea Monod

22 Jan 2025 (modified: 18 Jun 2025)Submitted to ICML 2025EveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Persistent homology captures distinct topological signatures of LLM representations under adversarial influence across different model architectures and sizes.

Abstract: Internal model representations are central to driving interpretability in machine learning and understanding them is key to reliability. Using persistent homology (PH), a technique from topological data analysis that captures the shape and structure of data at multiple scales, we present a global and local characterization of the latent space of three state-of-the-art Large Language Models (LLMs) under two adversarial conditions. Through a layer-wise topological analysis, we show that adversarial interventions consistently compress the latent space, reducing topological diversity at smaller scales while amplifying prominent structures at higher scales. Critically, these topological signatures are statistically meaningful and remain consistent across model architectures and sizes. We further introduce a novel neuron-level interpretability framework where PH is used to quantify information flow within and across layers. Our results establish PH as a powerful tool for interpretability in LLMs and for detecting distinct operational modes under adversarial influence.

Primary Area: Deep Learning->Large Language Models

Keywords: Topological Data Analysis, Persistent Homology, Large Language Models, Interpretability, Representations, Prompt Injection, Backdoors

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Submission Number: 7108

Loading