# Multi-Layer Sparse Autoencoders (MLSAE)

> [!NOTE]
> This repository accompanies REDACTED.
> See [References](#references) for related work.

## Pretrained MLSAEs

We define two types of model: plain PyTorch
[MLSAE](./mlsae/model/autoencoder.py) modules, which are relatively small; and
PyTorch Lightning [MLSAETransformer](./mlsae/model/lightning.py) modules, which
include the underlying transformer. HuggingFace collections for both are here:

- REDACTED
- REDACTED

We assume that pretrained MLSAEs have repo_ids with
[this naming convention](./mlsae/utils.py):

- REDACTED
- REDACTED

The Weights & Biases project for the paper is
REDACTED.

## Installation

Install Python dependencies with Poetry:

```bash
poetry env use 3.12
poetry install
```

Install Python dependencies with pip:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

Install Node.js dependencies:

```bash
cd app
npm install
```

## Training

Train a single MLSAE:

```bash
python train.py --help
python train.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32
```

## Analysis

Test a single pretrained MLSAE:

> [!WARNING]
> We assume that the test split of `monology/pile-uncopyrighted` is already downloaded
> and stored in `data/test.jsonl.zst`.

```bash
python test.py --help
python test.py --model_name EleutherAI/pythia-70m-deduped --expansion_factor 64 -k 32
```

Compute the distributions of latent activations over layers for a single
pretrained MLSAE
(REDACTED):

```bash
python -m mlsae.analysis.dists --help
python -m mlsae.analysis.dists --repo_id REDACTED --max_tokens 100_000_000
```

Compute the maximally activating examples for each combination of latent and
layer for a single pretrained MLSAE
(REDACTED):

```bash
python -m mlsae.analysis.examples --help
python -m mlsae.analysis.examples --repo_id REDACTED --max_tokens 1_000_000
```

## Figures

Compute the mean cosine similarities between residual stream activation vectors
at adjacent layers of a single pretrained transformer:

```bash
python figures/resid_cos_sim.py --help
python figures/resid_cos_sim.py --model_name EleutherAI/pythia-70m-deduped
```

Save heatmaps of the distributions of latent activations over layers for
multiple pretrained MLSAEs:

```bash
python figures/dists_heatmaps.py --help
python figures/dists_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64
```

Save a CSV of the mean standard deviations of the distributions of latent
activations over layers for multiple pretrained MLSAEs:

```bash
python figures/dists_layer_std.py --help
python figures/dists_layer_std.py --expansion_factor 32 64 128 -k 16 32 64
```

Save heatmaps of the maximum latent activations for a given prompt and multiple
pretrained MLSAEs:

```bash
python figures/prompt_heatmaps.py --help
python figures/prompt_heatmaps.py --expansion_factor 32 64 128 -k 16 32 64
```

Save a CSV of the Mean Max Cosine Similarity (MMCS) for multiple pretrained
MLSAEs:

```bash
python figures/mmcs.py --help
python figures/mmcs.py --expansion_factor 32 64 128 -k 16 32 64
```

## References

### Code

- <https://github.com/openai/sparse_autoencoder>
- <https://github.com/EleutherAI/sae>
- <https://github.com/ai-safety-foundation/sparse_autoencoder>
- <https://github.com/callummcdougall/sae_vis>

### Papers

- Gao et al. [2024] <https://cdn.openai.com/papers/sparse-autoencoders.pdf>
- Bricken et al. [2023]
  <https://transformer-circuits.pub/2023/monosemantic-features/index.html>
