# Transcoders Find Interpretable LLM Feature Circuits

This repository contains the code for our NeurIPS 2024 submission "Transcoders Find Interpretable LLM Feature Circuits".

To get started, we recommend working through the `walkthrough.ipynb` notebook. The full structure of the repository is as follows:

* `walkthrough.ipynb`: A walkthrough notebook that demonstrates how to use the tools provided in this repository for reverse-engineering LLM circuits with transcoders.
* `requirements.txt`: The standard Python dependencies list.

* `case_study_citations.ipynb`: An example of a reverse-engineering case study that we carried out, in which we investigated a transcoder feature that activates on semicolons in parenthetical citations.
* `case_study_caught.ipynb`: An example of a reverse-engineering case study that we carried out, in which we investigated a transcoder feature that activates on the verb "caught".
* `case_study_local_context.ipynb`: An example of a reverse-engineering case study that we carried out, in which we attempted to reverse-engineer a circuit that computes a harder-to-interpret transcoder feature. (We were less successful in this case study, but are including it in the interest of transparency.)
* `restricted blind case studies.ipynb`: A notebook containing a set of "restricted blind case studies" that reverse-engineer random GPT2-small transcoder features (as referenced in the paper).  

* `sae_training/`: Code for training and using transcoders, forked from an older version of [Joseph Bloom's excellent SAE repository](https://github.com/jbloomAus/SAELens). (The misnomer `sae_training` is a vestige of this origin of the code.)
* `transcoder_circuits/`: Code for reverse-engineering and analyzing circuits with transcoders. These are the tools that we use in the walkthrough notebook and in the case studies.
* `train_pythia_transcoder.py`: An example script for training a transcoder on Pythia.
* `train_pythia_transcoder.py`: An example script for training an SAE on Pythia.

* `interp-comparison.ipynb`: Code for the SAE-vs.-transcoder interpretability comparison carried out in the paper.
* `feature dashboards/`: The "feature dashboards" that were used in the SAE-vs.-transcoder interpretability comparison.
* `sweep.ipynb`: Code demonstrating how to evaluate transcoders' and SAEs' L0 and faithfulness metrics, as done in the paper.