## Installation

### Requirements

- Python 3.8+
- CUDA-capable GPU (recommended: 24GB+ VRAM for LLM experiments, 16GB+ for diffusion)

### Setup

1. Clone the repository:
```bash
git clone https://github.com/yourusername/midsteer.git
cd midsteer
```

2. Create and activate virtual environment:
```bash
python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
```

3. Install dependencies:
```bash
# For macOS/Darwin
pip install -r requirements/darwin.txt

# For Linux
pip install -r requirements/linux.txt
```

4. Set up Hugging Face authentication (required for downloading models):
```bash
huggingface-cli login
```

## Repository Structure

```
midsteer/
├── core/                       # Core library code
│   ├── llm_steering.py        # LLM steering implementation
│   ├── diffusion_steering.py  # Diffusion model steering
│   ├── controller.py          # Vector control logic
│   ├── dataset.py             # Dataset utilities
│   └── eval/                  # Evaluation metrics (CLIP, FID)
├── scripts/
│   ├── llm/                   # LLM experiment scripts
│   │   ├── generate_steering_vectors.py
│   │   ├── estimate_covariances.py
│   │   ├── run_with_steering.py
│   │   ├── concept_scoring.py
│   │   └── consistency_scoring.py
│   └── diffusion/             # Diffusion model scripts
│       ├── estimate_steering_vectors.py
│       ├── estimate_covariances.py
│       ├── run_with_steering.py
│       └── produce_scores.py
├── exp/
│   ├── datasets/              # Training and evaluation datasets
│   │   ├── train/            # Concept-specific questions for LLMs
│   │   └── eval/             # Evaluation templates
│   └── sh/                   # Shell scripts for running experiments
├── notebooks/
│   └── produce_charts.ipynb  # Generate paper figures
└── requirements/             # Platform-specific requirements
```

## Reproducing Paper Results

Our paper presents experiments on two main tasks across multiple models. Here's how to reproduce each result:

### 1. LLM Experiments

#### Models Tested
- Llama-2-7B-chat-hf
- Qwen2.5-7B-Instruct
- Qwen2.5-14B-Instruct

#### Concept Pairs
- horses ↔ motorcycles
- dogs ↔ cats

#### A. Concept Erasure (Section 4.2.1)

This experiment removes unwanted concepts from LLM outputs.

**Quick start (single GPU):**
```bash
.venv/bin/python3 scripts/llm/estimate_covariances.py \
    --model_name meta-llama/Llama-2-7b-chat-hf \
    --layer_type self_attn \
    --token_aggregation_mode all \
    --num_samples 50000 \
    --max_new_tokens 100 \
    --output_dir ./results/llama2-7b/covariances

.venv/bin/python3 scripts/llm/generate_steering_vectors.py \
    --model_name meta-llama/Llama-2-7b-chat-hf \
    --layer_type self_attn \
    --topics horses motorcycles dogs cats \
    --token_aggregation_mode last \
    --max_new_tokens 1 \
    --num_samples 1000 \
    --output_dir ./results/llama2-7b/steering_vectors

.venv/bin/python3 scripts/llm/run_with_steering.py \
    --model_name meta-llama/Llama-2-7b-chat-hf \
    --layer_type self_attn \
    --source_concept horses \
    --source_concept_path ./results/llama2-7b/steering_vectors/horses.pt \
    --target_concept_path ./results/llama2-7b/steering_vectors/motorcycles.pt \
    --steer_type midsteer \
    --strength 1.0 \
    --mu_neutral ./results/llama2-7b/covariances/means.pt \
    --cov_neutral ./results/llama2-7b/covariances/covariances.pt \
    --dataset_type template \
    --samples_per_question 10 \
    --max_new_tokens 100 \
    --output_dir ./results/llama2-7b/evaluation/horses_erasure

# Score the results
.venv/bin/python3 scripts/llm/concept_scoring.py \
    --concept horses motorcycles \
    --dir ./results/llama2-7b/evaluation/horses_erasure

.venv/bin/python3 scripts/llm/consistency_scoring.py \
    --dir ./results/llama2-7b/evaluation/horses_erasure
```

**Full experiment with multiple methods and strengths:**
```bash
# For SLURM clusters
sbatch --job-name=llm-erasure-llama2 \
    exp/sh/slurm_llm_base_experiment.sh \
    meta-llama/Llama-2-7b-chat-hf \
    self_attn \
    50000 \
    all \
    100 \
    "0.5 1.0 1.5 2.0 2.5 3.0"

# For Grid Engine clusters (qsub)
qsub -N llm-erasure-llama2 \
    exp/sh/slurm_llm_base_experiment.sh \
    meta-llama/Llama-2-7b-chat-hf \
    self_attn \
    50000 \
    all \
    100 \
    "0.5 1.0 1.5 2.0 2.5 3.0"
```

This script will:
1. Estimate covariances from 50k neutral prompts (Alpaca dataset)
2. Generate steering vectors for each concept (horses, motorcycles, dogs, cats)
3. Run experiments with CASteer, LEACE, and MidSteer at various strengths
4. Evaluate on template prompts, MMLU, and Alpaca datasets
5. Compute concept scores and consistency metrics

**Output:** Results are saved to `exp/results/{model_name}/{job_name}/evaluation/`

#### B. Concept Flipping (Section 4.2.2)

Same as erasure, but switches one concept to another (e.g., horses → motorcycles).

The `slurm_llm_base_experiment.sh` script runs both erasure and flipping experiments. Results for flipping appear in directories like `horses_to_motorcycles__horses/`.

### 2. Diffusion Model Experiments

#### Models Tested
- Stable Diffusion XL (SDXL)
- SANA 1.6B

#### Concept Pairs
- horse ↔ motorcycle
- snoopy ↔ mickey
- chihuahua ↔ muffin

#### A. Concept Erasure & Flipping

**Quick start (single GPU):**
```bash
.venv/bin/python3 scripts/diffusion/estimate_covariances.py \
    --model_name sdxl-turbo \
    --control_mode attn_output \
    --aggregation_mode all \
    --num_samples 50000 \
    --output_dir ./results/sdxl/covariances

.venv/bin/python3 scripts/diffusion/estimate_steering_vectors.py \
    --model_name sdxl-turbo \
    --control_mode attn_output \
    --topics horse motorcycle snoopy mickey chihuahua muffin \
    --aggregation_mode average \
    --num_samples 1000 \
    --output_dir ./results/sdxl/steering_vectors

.venv/bin/python3 scripts/diffusion/run_with_steering.py \
    --model_name sdxl \
    --control_mode attn_output \
    --generate_concept horse \
    --output_dir ./results/sdxl/evaluation/horse_to_motorcycle/midsteer-1.0 \
    --steering_method midsteer \
    --steering_strength 1.0 \
    --covariances_dir ./results/sdxl/covariances \
    --num_images_per_prompt 10 \
    --seed 42 \
    translate \
    --source_concept_path ./results/sdxl/steering_vectors/horse.pt \
    --target_concept_path ./results/sdxl/steering_vectors/motorcycle.pt

# Compute CLIP scores and FID
.venv/bin/python3 scripts/diffusion/produce_scores.py \
    --concept horse motorcycle \
    --dir ./results/sdxl/evaluation/horse_to_motorcycle \
    --num_workers 4 \
    --batch_size 32
```

**Full experiment with all methods:**
```bash
# For SLURM clusters
sbatch --job-name=diffusion-sdxl \
    exp/sh/slurm_diffusion_base_experiment.sh \
    sdxl \
    attn_output \
    50000 \
    all \
    "0.5 1.0 1.5 2.0 2.5 3.0"

# For Grid Engine
qsub -N diffusion-sdxl \
    exp/sh/slurm_diffusion_base_experiment.sh \
    sdxl \
    attn_output \
    50000 \
    all \
    "0.5 1.0 1.5 2.0 2.5 3.0"
```

This runs comprehensive experiments including:
- Concept translation (flipping) for all concept pairs
- Concept erasure for all concepts
- Multiple steering strengths with CASteer, LEACE, and MidSteer
- CLIP score and FID computation

**Output:** Results saved to `exp/results/{model_name}/{job_name}/evaluation/`

This notebook:
- Loads results from experiment directories
- Computes Pareto frontiers for each method
- Generates plots comparing CASteer, LEACE, and MidSteer
- Produces tables with numerical results
- Exports figures to `artefacts/` directory

### 3. Expected Results

**LLM Concept Flipping (horses → motorcycles):**
- MidSteer: Successfully switches concepts while preserving "motorcycle" prompt integrity
- LEACE/CASteer: May affect both forward and reverse directions

**Diffusion Concept Flipping:**
- MidSteer: Better preservation of unrelated concepts (lower FID on unrelated concepts)
- Higher CLIP score difference between source and target concepts

**Key Metrics:**
- **CS (Concept Score)**: Relevance to target concept (0-10 for LLM, CLIP score for diffusion)
- **ΔCS**: Difference between target and source concept scores
- **BERT Score**: Consistency of generated text (LLM only)
- **FID**: Image quality preservation (diffusion only)