Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks
Keywords: encoder language models, adversarial attacks, topological data analysis, persistent homology, attention mechanisms, textfooler, statistical tests, layer analysis, model interpretability
TL;DR: Using Topological Data Analysis (TDA), it is possible to track the influence of adversarial perturbations to language models' latent space by investigating the effects across layers
Abstract: Large language models are known to be vulnerable to adversarial perturbations such as synonym-based word substitutions. However, previous analyses of adversarial influence focus only on output behavior and provide limited insight into the propagation of substitution-based input perturbations through internal representations. In this work, we introduce a topological data analysis (TDA) framework to study the structural effects of adversarial attacks on attention maps across model layers.
We evaluate small encoder-based architectures (BERT, RoBERTa, DistilBERT) fine-tuned to solve binary classification on the IMDb review dataset, which were attacked using TextFooler.
We convert attention maps into distance matrices and apply TDA to extract topological features, which we then compare using Wasserstein distances between original and perturbed features. In parallel, we compute a non-TDA baseline on attention maps using per-head $L_1$ distances between original and perturbed attentions.
In addition, we analyze these models on a layer-by-layer basis.
We find that adversarial perturbations induce systematic and statistically significant topological changes across layers, with the largest deviations occurring in late layers and smaller but notable effects in early layers. These patterns are consistent across models and are validated using both non-parametric (Kruskal--Wallis, Dunn) and parametric (one-way ANOVA, Tukey) tests on log-transformed Wasserstein distances. Compared to our non-TDA baseline, our results show more distinct layer-wise separation and provides a robust and interpretable framework for evaluating how adversarial perturbations alter internal model structure.
Our code is publicly available at: https://github.com/angelinatsai04/mitll_clinic/tree/adam_spring.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 33
Loading