# 🌉 SAE4Steer: From Interpretability to Utility
**Paper:** _Does higher interpretability imply better utility? A Pairwise Analysis on Sparse Autoencoders_  
**Core Goal:** **Build a bridge** between **interpretability** and **utility** of SAE.

---

## 📚 Table of Contents
## 1. Overview
This repository orchestrates a complete pipeline to:
1) **Train** Sparse Autoencoders (SAEs) via dictionary learning,  
2) **Evaluate** interpretability with **SAEBench**,  
3) **Measure utility** with **steering** (AxBench-style scoring),  
4) **Analyze** the **pairwise relation** between interpretability scores and steering performance, and  
5) **Select features** with **Δ Token Confidence** and **Output Score**.

The pipeline is designed to be **modular**, **reproducible**, and **GPU-friendly**, using only **relative paths** throughout.

---

## 2. Environment & Prerequisites
> 💡 Use **relative paths**; **do not** hardcode machine-specific directories.

- **Python** & CUDA toolchain compatible with your target LLMs and frameworks.
- **GPU**: A100/A800-class recommended for larger models; adjust batch sizes as needed.
- **Dependencies**: Install each external repository per its own README.
- **OpenAI (optional, for judging)**: If you use `--judge_backend openai_async`, ensure API credentials are correctly exported (e.g., `OPENAI_API_KEY`).

---

## 3. Train SAEs (Dictionary Learning)
We use **[dictionary_learning](https://github.com/saprmarks/dictionary_learning)** for SAE training.

### 3.1 Configure
Edit the following files:

- **`sae4steer/dictionary_learning_demo/demo_config.py`**  
  Set `LLM_CONFIG` and other parameters to match your base model & layers.  
  _Example (illustrative placeholders):_
  ```python
  # sae4steer/dictionary_learning_demo/demo_config.py
  LLM_CONFIG = {
      "hf_model_name": "google/gemma-2-2b",
      "device": "cuda",
      "dtype": "bfloat16",
      "context_length": 2048,
      "activation": "resid_post",
      "layers": [12],  # list or range of target layers
  }

  SAE_CONFIG = {
      "arch": "top_k_jump_relu",
      "k": 330,
      "hidden_size": 16384,
      "sparsity_target": 0.03,
      "lr": 1e-3,
      "sae_batch_size": 2048,
  }

  TRAINING_CONFIG = {
      "llm_batch_size": 4,
      "num_workers": 4,
      "save_dir": "./SAEBench/sae_bench/custom_saes/downloaded_saes/",
      "logging_every": 100,
  }
  
sae4steer/dictionary_learning_demo/demo.py
sae4steer/dictionary_learning_demo/parallel_training.py
Ensure paths, layer list, and any launcher settings are coherent with demo_config.py.

### 3.2 Launch Training
From the demo directory:
`cd ./sae4steer/dictionary_learning_demo`

`tmux new -s sae_train`
`python -u parallel_training.py`

### 3.3 Resource Footprint (Reference)
⏱️ These are example end-to-end references; tune for your hardware.

| SAE → Model | Count |    Disk |  Wall Time | llm\_batch\_size | context\_length | sae\_batch\_size | Memory | GPU    |
| ----------- | ----: | ------: | ---------: | ---------------: | --------------: | ---------------: | -----: | ------ |
| Gemma2-2b   |    30 |  8.7 GB |    16h 30m |                4 |            2048 |             2048 |  20 GB | 2×A800 |
| Gemma2-9b   |    30 | 13.2 GB | 2d 12h 35m |                4 |            2048 |             2048 |  70 GB | 2×A800 |
| Qwen2.5-3b  |    30 |  7.7 GB |     1d 13h |                4 |            2048 |             2048 |  30 GB | 2×A800 |


## 4. Evaluate with SAEBench (Interpretability)
We use SAEBench and focus on interpretability metrics.

 1) Start a new session (optional)
`tmux new -s saebench`

 2) Go to custom SAE runner directory
`cd ./sae4steer/SAEBench/sae_bench/custom_saes`

 3) Set PYTHONPATH for this session
`export PYTHONPATH="$(pwd)/../../..:$PYTHONPATH"`

 4) Create a logs folder (optional but recommended)
`mkdir -p ./eval_results/logs`

 5) Run (and tee logs)
`CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0 \
python run_all_evals_dictionary_learning_saes.py \
  2>&1 | tee -a ./eval_results/logs/$(date +%F_%H-%M-%S)_dict_saes.log`
  
🔎 In run_all_evals_dictionary_learning_saes.py, select which interpretability metrics to run for your trained SAEs. Our work prioritizes interpretability.

## 5. Compute Steering Utility (AxBench + Steering Pipeline)
We compute SAE steering utility following AxBench-style scoring and saes-are-good-for-steering pipelines.

### 5.1 Link SAEs by Sparsity
After SAEBench (or after saving trained SAEs), link by sparsity:

`cd ./sae4steer/SAEBench/sae_bench/custom_saes`

`python link_sae_by_sparsity.py \
  ./downloaded_saes/trained_saes_2__google_gemma-2-2b_top_k_jump_relu/resid_post_layer_12`

  
### 5.2 Prepare Instructions (Alpaca)
In AxBench:
`cd ./sae4steer/axbench/axbench`

Downloads Alpaca eval set → ./data/alpaca_eval.json
`bash ./data/download-alpaca.sh`

Generate concept CSVs (Jupyter)
Open and run: ./scripts/concepts_generate.ipynb
This produces concept CSV files like:
./concept10_gemma2_2b_data/batch_topk_80_0.8357.csv

### 5.3 Turn Concepts → JSON
In saes-are-good-for-steering:

`cd ./sae4steer/saes-are-good-for-steering`

-> environment/bootstrap
`bash concept_load.sh`

-> Convert the concept CSV to a JSON spec used by steering
`python concept_load.py \
  --csv ./../axbench/axbench/concept10_gemma2_2b_data/batch_topk_80_0.8357.csv \
  --out_dir ./concept \
  --out_name concept_descriptions.json`
  
### 5.4 Convert Concepts → Features

-> Produces ./data/features/gemma2_2b_features.json (example)
`python convert_concepts_to_features.py`
Configure the script if needed to point at your model/layer mapping and concept JSON.

### 5.5 Run Steering
🧪 You can use the shell launcher or invoke Python directly.

Single/Multi-GPU launchers (examples):
`bash launch_2_workers.sh`
or
```
python run_sae_steering.py \
  --model_type gemma2_2b_it \
  --dl_local_dir ./../SAEBench/sae_bench/custom_saes/downloaded_saes/\
trained_saes_2__google_gemma-2-2b_top_k_jump_relu/resid_post_layer_12/jumprelu_330 \
  --features_file ./data/features/gemma2_2b_features.json \
  --instructions_file ./../axbench/axbench/data/alpaca_eval.json \
  --concepts_file ./concept/concept_descriptions.json \
  --judge_backend openai_async \
  --judge_model gpt-4o-mini \
  --layers 12 \
  --steering_factors 0.2,0.4,0.8,1.5,2.0,3.0 \
  --dev_k 5 \
  --max_new_tokens 128 \
  --save_dir ./runs/debug_try \
  --debug --sample_print_k 1 --print_chars 300
```
🔐 If using openai_async judging, configure API credentials in your environment before running.

## 6. Pairwise Analysis: Interpretability ↔ Utility
Open the notebook:

./sae4steer/saes-are-good-for-steering/src/exp_results/interp_steering_pairwise.ipynb
Set the paths to your interpretability scores (from SAEBench) and steering scores (from Section 6).

The notebook performs a pairwise analysis to test:
Does higher interpretability imply better utility?

## 7. Feature Selection via Δ Token Confidence & Output Score
We provide scripts to select utility-better features via Δ Token Confidence and compare with Output Score (following saes-are-good-for-steering).

Run one of the provided examples (choose per model), e.g. Gemma2-2b:


`cd ./sae4steer/saes-are-good-for-steering`

`bash ./src/outputscore_entropy_confidence_all_gemma2_2b.sh`

⭐ Core code (for customization):
`./sae4steer/saes-are-good-for-steering/src/output_score_with_entropy_confidence.py`

Configure your model, layers, and SAE paths inside the bash script.

Results are stored in:
`./sae4steer/saes-are-good-for-steering/cache/results_entropy_score/`

Each SAE’s features receive both Δ Token Confidence and Output Score.
Use these to perform SAE feature selection, see:
`./sae4steer/saes-are-good-for-steering/src/exp_results/token_conf_steering_score.ipynb`

## 8. Tips & Reproducibility
✅ Relative Paths Only: Keep everything under the project root (as shown).

📦 Checkpoints & Artifacts: Store trained SAEs under ./SAEBench/sae_bench/custom_saes/downloaded_saes/.

🔁 Logging: Use tee to persist logs; include --save_dir for steering runs.

📈 Layers & Features: Keep layer indices consistent across training → evaluation → steering.



## 9. Acknowledgements 🧩
We thank the following repositories for their excellent work and codebases:

SAE Training (Dictionary Learning): https://github.com/saprmarks/dictionary_learning

SAEBench: https://github.com/adamkarvonen/SAEBench

Steering Score (AxBench): https://github.com/stanfordnlp/axbench

Output Score: https://github.com/technion-cs-nlp/saes-are-good-for-steering
