# Activation Reasoning
> **Disclaimer:** This is still research code and we will clean up the code for the final camera ready release.

## Highlights
- **End-to-end activation reasoning pipeline** that searches for SAE features, detects concept activations, reasons over rules, and steers hidden states before token generation.
- **Configurable concept search** (word-, position-, or sentence-level) with greedy top-k or decision-tree strategies.
- **Rule-based steering** with per-rule scaling, alternative weighting functions, and multiple norm choices.
- **Evaluation suites** covering ProverQA, ProntoQA, BeaverTails, Golden Gate, and the R2C trains benchmark.
- **Dataset tooling** under `experiments/` and `ar/utils.py` for generating synthetic reasoning problems and color-coded train prompts.

## Repository Layout
- `ar/` – core Python package (configuration objects, concept search/detection, `ActivationReasoning` model, logical parser, utilities).
- `experiments/` – runnable scripts and notebooks for ProverQA, ProntoQA, trains, BeaverTails, Golden Gate, and dataset utilities.
- `data/` – benchmark assets (ProntoQA, ProverQA, R2C trains, Eiffel Tower & Golden Gate corpora, OpenAI moderation samples, etc.).
- `safety/` – hooks for loading external safety corpora (expects the upstream `r2guard` project in `safety/r2guard`).
- `sparsify/` – vendorized helpers for loading EleutherAI sparse autoencoders.

## Requirements
- Python 3.10
- CUDA-capable GPU (default configs load 8B-class models like `meta-llama/Meta-Llama-3.1-8B`; budget ~24 GB VRAM or swap to a smaller model).
- Hugging Face access token via `export HF_TOKEN=...` for downloading base models and SAE checkpoints.

The provided `requirements.txt` is intentionally exhaustive: it covers notebook tooling, plotting stacks, Hugging Face ecosystem libraries, SAE handling, and safety evaluation dependencies. Install selectively if you only need a subset of experiments.

## Installation
```bash
python3.10 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
```


## Quick Start (Golden Gate Example)
The snippet below shows a real Golden Gate Bridge paragraph flowing through concept search and detection. It assumes `data/eiffel_tower_wikipedia.txt` exists (bundled in this repository).
For execution see `golden_gate.ipynb`.

```python
from pathlib import Path
from ar import ActivationReasoning, LogicConfig

# Map concept tuples to the landmark they identify.
rules = {
    ("Bridge", "San Francisco", "United States"): "Golden Gate Bridge",
    ("Tower", "Paris", "France"): "Eiffel Tower",
}
concepts = sorted({concept for lhs in rules for concept in lhs})

# Build a corpus with real sentences so concept search has signal.
background_corpus = [
    sentence for sentence in Path("data/eiffel_tower_wikipedia.txt").read_text().split(". ") if sentence
][:20]
background_corpus.append(
    "Sydney is famous for the Opera House that overlooks the harbour in Australia."
)

model = ActivationReasoning(
    rules=rules,
    concepts=concepts,
    config=LogicConfig(search_concept_type="word", search_top_k_order="original_order"),
    cache_dir="output/cache/landmarks_demo",
    model_name="meta-llama/Meta-Llama-3.1-8B",
    sae_name="EleutherAI/sae-llama-3.1-8b-64x",
    hookpoint="layers.23",
    verbose=False,
)

# 1) Collect SAE activations for the landmark concepts.
model.search(inputs=background_corpus, reset_cache=True, batch_size=8)

# 2) Detect concepts in an unseen Golden Gate Bridge paragraph.
paragraph = (
    "The Golden Gate Bridge, an iconic suspension bridge in San Francisco, United States, "
    "spans approximately 1.7 miles across the Golden Gate Strait, the entrance to San Francisco "
    "Bay from the Pacific Ocean. Completed in 1937, it was the longest and tallest suspension "
    "bridge in the world at the time."
)
metadata = model.detect([paragraph], verbose=False)
print(metadata[0]["concepts"])  # ['Bridge', 'San Francisco', 'United States']
print(metadata[0]["rules"][0])  # [('Bridge', 'San Francisco', 'United States') -> 'Golden Gate Bridge']
```
The detector surfaces the active concepts and the reasoner confirms that `(Bridge, San Francisco, United States)` triggers the `Golden Gate Bridge` rule. The same metadata powers steering during generation or downstream analyses.

### Key Concepts
- **Search** (`ActivationReasoning.search`) builds SAE caches for each concept under `cache_dir`.
- **Detection** monitors the chosen layer during generation and flags concept activations.
- **Reasoner** (`ALReasoner`) reconciles detected concepts with the provided rules (`reasoner_rules_checking` supports `legacy`, `simple`, `complex`, `open_world`).
- **Steering** adjusts hidden activations when rules fire. Disable by setting `steering_factor=0` or by using `LogicConfigNoSteering.DEFAULT`.

### LogicConfig Overview
Important parameters (see `ar/config.py` for full documentation):
- `search_concept_type`: granularity for concept extraction (`"word"`, `"sentence"`, `"position"`).
- `search_strategy`: greedy `"top_k"` vs. decision-tree `"tree"` searches.
- `search_top_k`, `detection_top_k_concepts`, `detection_top_k_output`: control how many SAE latents are cached and monitored.
- `steering_factor`, `steering_top_k_rule`, `steering_weighting_function`, `steering_norm`, `steering_methodology`: tune the steering behaviour.
- `reasoner_rules_checking`: choose the reasoning approach for matching concept sets to rules.

## Experiment Workflows
### R2C (color-coded trains)
`experiments/countries.py` and `experiments/generation_timing_r2c.py` benchmark Activation Reasoning versus baseline prompting:
```bash
CUDA_VISIBLE_DEVICES=0 python experiments/generation_timing_r2c.py --dataset_type explicit --num_samples 200
```
The script loads LLaMA 3.1 8B, performs concept search on the training split, times generation on the test prompts, and writes CSV summaries.

### ProverQA
`experiments/ProverQA_eval.py` evaluates Activation Reasoning on ProverQA variants:
```bash
CUDA_VISIBLE_DEVICES=0 python experiments/ProverQA_eval.py \
  --model_name llama_31_base \
  --steering \
  --difficulties easy medium hard
```
Runs live under `output/experiments/proverqa/`, caching SAE activations and logging JSON artefacts with ground truth, detected concepts, solver outputs, and generation accuracy. Preset configs cover Gemma 2 9B, Gemma 2 9B-IT, LLaMA 3.1 8B, and LLaMA 3.1 8B-Instruct.

### ProntoQA & Notebooks
- `experiments/ProntoQA_logic.py` parses natural-language rules into symbolic representations usable by Activation Reasoning.
- `experiments/prontoQA_eval.ipynb`, `experiments/ProverQA.ipynb`, and companion notebooks reproduce project figures; they expect cached SAE activations in `output/` and datasets in `data/`.
- `experiments/trains_eval.ipynb`, `experiments/BeaverTails.ipynb`, `golden_gate.ipynb`, etc., explore additional corpora.

## Data Assets
Notable files in `data/`:
- `r2c/{Mono,Meta}/{train,eval,test}.csv` (color-coded trains prompts)
- Eiffel Tower and Golden Gate corpora plus other reasoning datasets used in notebooks

Most scripts write caches, plots, and logs to `output/` (or the `cache_dir` you supply). Create the directory beforehand or pass a custom path when instantiating `ActivationReasoning`.

## Tips & Troubleshooting
- **Model downloads:** set `HF_TOKEN`; optionally configure `HF_HOME` so weights persist across runs.
- **VRAM pressure:** cap `max_new_tokens`, limit concept sets, choose a smaller model, or rely on the default `torch.float16` execution.
- **Cache resets:** call `search(..., reset_cache=True)` whenever concept definitions or SAE checkpoints change.