# Shape of Adversarial Influence: Characterizing LLM Latent space with Persistent Homology 

This repository contains the code to reproduce the results shown in our ICLR 2025 paper. The analysis aims to highlight intrinsic differences in the topology of normal versus adversarial activations, facilitating interpretations of the underlying causes of these distinctions. We implement the pipeline described in the paper using persistent homology (PH) to analyze activation patterns.

We use \textsc{Ripser++} to compute barcodes based on Vietoris--Rips filtrations, leveraging subsampling techniques to mitigate the computational constraints of PH, making it infeasible to compute the barcode of the entire dataset. The barcodes are vectorized into 41-dimensional summaries, followed by cross-correlation analysis to remove highly correlated variables. We apply PCA and canonical correlation analysis (CCA) to investigate feature importance, and train logistic regression models with Shapley values to assess predictive power.

Additionally, we study information flow between consecutive and non-consecutive layers by analyzing neuron-level activations. Instead of relying on aggregated representations, our interpretability strategy maps the weights of individual neurons in consecutive layers into a 2D coordinate space. We apply PH to these neuron-specific activations, uncovering structural patterns that provide insights into network behavior. We also test control conditions to validate whether neuron activations capture meaningful signals by examining topological distortions under random shuffling of neuron indices.

## Code Structure and Overview

The analysis is structured into two main sections: **Global Analysis** and **Local Analysis**.

### Global Analysis

The **Global Analysis** section includes the main notebook **global_analysis.ipynb**, which serves as the central file for computing persistence barcodes, running Principal Component Analysis (PCA) and Logistic Regression (LR), and performing SHAP value analysis. It executes the following pipeline:

1. **Loading barcodes**: Reads barcodes from computed data files for different layers and models.
2. **Feature extraction**: Computes statistical summaries of topological features.
3. **PCA Analysis**: Conducts Principal Component Analysis (PCA) to reduce dimensionality.
4. **CCA Analysis**: Performs Canonical Correlation Analysis (CCA) to investigate dependencies.
5. **Regression Analysis**: Trains a logistic regression model on the extracted features.
6. **SHAP Analysis**: Uses SHAP values to assess the predictive contribution of different features.

This section also contains two subfolders:
- **variance/dispersion/**: Contains scripts for measuring structural differences in activation spaces:
  - `choosing_k.py`: Determines the optimal neighborhood size for local dispersion calculations.
  - `clean_poisoned_ablation.py`: Conducts an ablation to verify signifcance of observed inter-condition (class) dispersion ratios
  - `clean_poisoned.py`: Computes dispersion ratio differences between clean and poisoned activations.
- **variance/distances/**: Contains `distances.py`, which measures pairwise distances between activations across different layers and conditions.

### Local Analysis

The **Local Analysis** section focuses on consecutive and non-consecutive layer analysis and persistence diagram creation to investigate local structural variations in activations. It contains:
- `compute_persistence_diagrams/`: Code for generating persistence diagrams.
- Notebooks for specific analyses:
  - `1_step_consecutive_layers_analysis.ipynb`
  - `3_step_consecutive_layers_analysis.ipynb`
  - `10_step_consecutive_layers_analysis.ipynb`
  - `Llama3_local_analysis.ipynb`
  - `Mistral_local_analysis.ipynb`
  - `Phi3_local_analysis.ipynb`

### Models Analyzed

Most analyses are performed on the three primary models highlighted in the ICML paper:

- **Mistral-7B**
- **Phi-3 3.7B**
- **Llama-3 8B**

Some experiments also compare additional models, including:

- **Llama-3 70B**
- **Mixtral-8 7B**
- **Phi-3 Medium Instruct (128k)**

## Loading Data

We load the activation data using the following approach.

### Activation Tensor Structure

The activation data being analyzed can be changed easily by submitting different input files to the main analysis scripts, as they loop through models and layers dynamically.

The activation tensor has the following shape:

- **First dimension:** Type of data
  - Index **0**: Instruction only
  - Index **1**: Instruction + Data block
  - In the poisoned activations, the data block includes an injected instruction. The injected instruction is benign in this case, as we avoided training on specific attack examples to improve generalizability.
- **Second dimension:** Number of samples (1000)
- **Third dimension:** Number of layers (32). Use this index to extract a specific layer of activations from the LLM. For example, to retrieve the activations of the 10th layer, use `activations[:, :, 9, :]`.
- **Fourth dimension:** Depth of each layer (4096).

### Data Structure Format (Global Analysis)

**Remark:** This code assumes that you have already computed the barcodes for the global analysis. It takes as input two dictionaries (one for normal and one for adversarial activations) structured as follows:

```python
{
  layer: [
    [[bars_0], [bars_1]],  # First subsample
    [[bars_0], [bars_1]],  # Second subsample
    ...
    [[bars_0], [bars_1]]   # num_subsamples-th subsample
  ]
}
```

Each layer index maps to a list containing `num_subsamples` barcodes. Each barcode is encoded as two lists:
- `bars_0`: The 0-bars.
- `bars_1`: The 1-bars.

### Code Example for loading additional activation data needed as input to compute persistent homology barcodes

```python
import torch

# Load the activation data (e.g., contains 1000 samples)
clean_activations = torch.load('activations/activations_0.pt')
poisoned_activations = torch.load('activations/activations_1.pt')
# Shape of each {condition}_activations: (2, 1000, 32, 4096)

# Subtract the first dimension to remove the instruction-only activations and compare data blocks
clean_activations = clean_activations[1] - clean_activations[0]
poisoned_activations = poisoned_activations[1] - poisoned_activations[0]
# Shape after subtraction: (1000, 32, 4096)
```
