# Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders

This repository contains the supplementary code for our NeurIPS submission, "Transformer Key-Value Memories Are Nearly as Interpretable as Sparse Autoencoders." The files herein showcase the core adaptations made to SAEBench to facilitate evaluations with Feedforward Key-Value (FF-KV) memories.

Our modifications to the original SAEBench encompass two primary areas:
- Implementation of FF-KVs and Transcoders as custom Sparse Autoencoder (SAE) classes.
- Adaptation of the evaluation code to be compatible with FF-KVs and Transcoders.

## Custom SAE Implementations
All custom SAE classes developed for use within SAEBench are located in the `custom_saes/` directory.
Specifically, we have implemented the following:
- **FF-KVs** (`custom_saes/ff_kv.py`)
- **Normalized FF-KVs** (`custom_saes/ff_kv_normalized.py`)
- **k-Sparse FF-KVs** (`custom_saes/k_sparse_ff_kv.py`)
- **Normalized k-Sparse FF-KVs** (`custom_saes/k_sparse_ff_kv_normalized.py`)
- **Transcoder class for Gemma** (`custom_saes/jumprelu_sae.py`)
- **Transcoder class for GPT2** (`custom_saes/gpt2_transcoder.py`)
- An **additional configuration class** to distinguish standard SAEs from FF-KVs/Transcoders (`custom_saes/custom_sae_config.py`)

The `base_sae` class has been retained for reference purposes.

## Adaptations to Evaluation Procedures
We have also modified each evaluation script to ensure the correct activations are utilized during the evaluation of FF-KVs and Transcoders.
Specifically, the following scripts were adapted, with our modifications clearly marked by the comment `# MODIFIED`:
- **Core Evaluation**: Responsible for the "Feature Alive" and "Explained Variance" metrics (`evals/core/main.py`).
- **Absorption**: (`evals/absorption/common.py` and `sae_bench_utils/activation_collection.py`).
- **Sparse Probing**: (`evals/sparse_probing/main.py`).
- **Autointerp**: (`evals/autointerp/main.py`).
- **SCR and TPP**: (`evals/scr_and_tpp/main.py`).
- **RAVEL**: (`sae_bench_utils/activation_collection.py`).

All original evaluation code has been retained for completeness and reference. Default parameters were used for all evaluations; these can be found in `eval_config.py` within the respective directory for each evaluation script.