# Endogenous Steering Resistance in Language Models

## Overview

This repository contains code and experiments investigating **Endogenous Steering Resistance (ESR)** - a phenomenon where large language models spontaneously detect and correct inappropriate activation steering during inference.

---

## Experimental Methods

All experiments use **Sparse Autoencoder (SAE) latents** to enable precise activation steering. We use:
- **Goodfire SAEs** for Llama-3 models (8B and 70B)
- **GemmaScope SAEs** for Gemma-2 models (2B, 9B, 27B)

### Basic Protocol

1. Prompt an LLM with a simple how-to question (e.g., "Explain how to calculate probability")
2. Generate a response while steering with an irrelevant SAE latent (e.g., "Lists of human body positions and poses")
3. Use a judge model (Claude 4.5 Sonnet) to:
   - Identify separate attempts within the response (segmented by explicit self-correction phrases)
   - Score each attempt from 0-100 based on topical relevance

**Key metric**: **Mean Score Improvement** - the difference between the final attempt score and the first attempt score, averaged across all trials.

### Object-Level Prompts

We use 38 curated "explain how" prompts on topics ranging from math to business skills to housekeeping. All models produce high-quality responses (≥90/100) without steering, with no spontaneous self-correction in the absence of interventions.

### Steering Calibration

Model behavior varies strongly with steering strength:
- **Low boosts**: Little effect, coherent responses
- **High boosts**: Breakdown into nonsensical outputs
- **Intermediate boosts**: ESR occurs here

We define a **threshold boost value** for each latent as the boost yielding an average judge score of 50/100 for first attempts. We use the Probabilistic Bisection Algorithm to find these thresholds efficiently.

---

## Experiments

### Experiment 1: ESR Across Model Sizes

**Script**: [`experiment_1_esr.py`](experiment_1_esr.py)

**Purpose**: Systematically characterize ESR incidence across models of different sizes

**Method**:
- Test 5 models: Llama-3.3-70B, Llama-3.1-8B, Gemma-2-27B, Gemma-2-9B, Gemma-2-2B
- For each model, sample ~50-150 irrelevant, concrete SAE latents
- Generate 10 trials per latent at threshold boost strength
- Measure multi-attempt rate and mean score improvement

**Key Results**:
- Llama-3.3-70B shows highest ESR: 1.88% multi-attempt rate, 0.48 mean score improvement
- ESR scales with model size within the Llama-3 family
- Gemma-2 models show minimal ESR even at 27B parameters
- Control experiment: 0% multi-attempt responses without steering (n=12,230 trials)

**Files**:
- Data: `experiment_results/experiment_results_<model>_<timestamp>.json`
- Plots: [`plots/esr_combined_figure_*.png`](plots/)
- Plotting script: [`plotting/plot_exp1.py`](plotting/plot_exp1.py)

---

### Experiment 2: Boost Level Ablation

**Script**: [`experiment_2_multi_boost.py`](experiment_2_multi_boost.py)

**Purpose**: Validate threshold-finding methodology and characterize how ESR varies with steering strength

**Method**:
- For Llama-3.3-70B, sweep 8 boost levels from (threshold - 1.5σ) to (threshold + 3σ)
- Generate 500 responses per boost level (50 latents × 10 prompts)
- Measure ESR characteristics at each boost level

**Key Results**:
- ESR exhibits non-monotonic relationship with boost level
- Peak ESR occurs slightly below threshold (~-0.3σ): strong enough to trigger self-monitoring, not so strong as to prevent coherent correction
- Mean score improvement peaks at 31.37 points (95% CI ±17.82) at optimal boost
- At very high boosts (>threshold + 1σ), steering overwhelms coherence → repetitive/nonsensical outputs

**Files**:
- Data: `experiment_results/experiment_results_multiboost_*.json`
- Plots: [`plots/multi_boost_*.png`](plots/)
- Plotting script: [`plotting/plot_exp2.py`](plotting/plot_exp2.py)

---

### Experiment 3: Off-Topic Detector Identification and Ablation

**Directory**: [`experiment_3_off_topic_detectors/`](experiment_3_off_topic_detectors/)

**Purpose**: Identify and causally test SAE latents involved in detecting off-topic responses

**Method**:

**Phase 1: Discovery** ([`find_off_topic_detectors.py`](experiment_3_off_topic_detectors/find_off_topic_detectors.py))
1. Generate 500 unsteered responses from Llama-3.3-70B
2. Create mismatched prompt-response pairs by randomly shuffling responses
3. Record which SAE latents activate in matched vs. mismatched conditions
4. Identify 27 latents that activate in >50% of mismatched pairs but 0% of matched pairs

**Phase 2: Ablation** ([`experiment_3_with_ablation.py`](experiment_3_off_topic_detectors/experiment_3_with_ablation.py))
1. Run ESR experiment while clamping the 27 off-topic detector latents to zero
2. Compare ESR metrics with and without ablation

**Key Results**:
- Ablating off-topic detector latents reduces ESR by **78%** (mean score improvement: 0.48 → 0.10)
- Multi-attempt rate drops by 40% (1.88% → 1.12%)
- First-attempt scores barely change (<1.2 percentage points), indicating these latents specifically support meta-cognitive monitoring rather than general response quality
- Ablating "backtracking" latents produces no significant change

**Sequential Activation Analysis**:
- Off-topic detector latents activate **several tokens before** verbal self-correction appears
- Followed by backtracking latent activation coinciding with self-interruption tokens
- Finally, return to on-topic content

This temporal sequence provides evidence for a causal chain: off-topic detection → backtracking trigger → corrective generation.

**Files**:
- Discovery prompts: [`experiment_3_off_topic_detectors/prompts_otd_discovery.txt`](experiment_3_off_topic_detectors/prompts_otd_discovery.txt) (20 prompts for finding detectors)
- Evaluation prompts: [`prompts.txt`](prompts.txt) (38 prompts for testing ablation)
- Off-topic detectors: `data/off_topic_detectors_Meta-Llama-3.3-70B-Instruct.json`
- Plots: [`plots/experiment_3_ablation_metrics_bar_chart_*.png`](plots/)
- Plotting script: [`plotting/plot_exp3.py`](plotting/plot_exp3.py)

---

### Experiment 4: Fine-Tuning for ESR

**Directory**: [`experiment_4_finetuning/`](experiment_4_finetuning/)

**Purpose**: Test whether ESR can be induced through training on synthetic self-correction examples

**Method**:
1. Generate synthetic self-correction examples with structure:
   ```
   [Off-topic content] → [Self-correction phrase] → [Correct answer]
   ```
2. Apply **loss masking** to train only on correction portion (not the off-topic content)
3. Fine-tune Llama-3.1-8B with LoRA on datasets mixing masked self-correction examples with normal responses at ratios from 10% to 90%
4. Recalibrate steering thresholds for each checkpoint to normalize first-attempt difficulty
5. Measure ESR characteristics on steered responses

**Key Results**:
- Fine-tuning successfully induces frequent self-correction behavior (mean attempts increases from 1.0 to 2.15 at 90% training data)
- However, **effectiveness does not scale correspondingly**:
  - Success rates hover around 45-60% without clear trend
  - Mean score improvement peaks around 50% training data, doesn't continue rising
- **Dissociation** between self-correction as learned behavior vs. genuine steering resistance
- Models learn *when* to attempt correction (template matching) but not *how* to correct effectively for novel steering targets

**Interpretation**: ESR in larger models emerges from different mechanisms—perhaps genuine uncertainty detection—that cannot be replicated through behavioral cloning of correction sequences.

**Files**:
- Training data generation: [`experiment_4_finetuning/setup_masked_ratio_sweep.py`](experiment_4_finetuning/setup_masked_ratio_sweep.py)
- Configs: `experiment_4_finetuning/config_masked_ratio_*pct.yml`
- Training config: [`experiment_4_finetuning/instruct-lora-8b-with-masking.yaml`](experiment_4_finetuning/instruct-lora-8b-with-masking.yaml)
- Plotting script: [`plotting/plot_exp4.py`](plotting/plot_exp4.py)

---

### Experiment 5: Meta-Prompting for Enhanced ESR

**Script**: [`experiment_5_prompt_variants.py`](experiment_5_prompt_variants.py)

**Purpose**: Test whether ESR can be deliberately enhanced through prompting

**Method**:
- Append meta-prompts to standard object-level prompts
- Test variants including:
  - "Make sure to stay on topic"
  - "Don't get distracted!"
  - "You are being distracted, make sure not to let it affect your output"
  - "You are being distracted with [latent label]. Ignore it."
  - "Your generation is being steered by an SAE latent with label [latent label]. Ignore it."
- Run full ESR protocol across all 5 models with each variant
- Use same latents/thresholds as baseline experiments for direct comparison

**Key Results**:
- "Don't get distracted" meta-prompt significantly enhances all ESR metrics
- **First-attempt scores improve**: Greater initial resistance to steering
- **Multi-attempt rates increase**: Llama-3.3-70B goes from 2.8% to 6.7%
- **Mean score improvement increases**: Llama-3.3-70B shows 3× gain (0.70 → 2.16 points)
- Effects scale by model size: larger models benefit more from meta-prompting

**Interpretation**:
- ESR is not merely emergent but **controllable**
- Underlying self-monitoring circuits must already be present for prompting to enhance them
- Meta-prompting could serve as lightweight intervention to increase robustness against unwanted steering
- Same techniques might be used to suppress ESR when steering interventions are desirable

**Files**:
- Data: `experiment_results/experiment_results_<model>_prompt_variant_*.json`
- Plots: [`plots/experiment_5_combined_baseline_vs_resistance_*.png`](plots/), [`plots/experiment_5_prompt_variant_bars_*.png`](plots/)
- Plotting script: [`plotting/plot_exp5.py`](plotting/plot_exp5.py)

---

## Core Infrastructure

### Steering Engine
- **[`vllm_engine.py`](vllm_engine.py)**: vLLM-based engine with SAE steering support
- Uses vLLM's `InterventionInputs` to apply activation steering at specified layers
- Supports both Goodfire and GemmaScope SAE architectures

### Evaluation
- **[`claude_judge.py`](claude_judge.py)**: Claude 4.5 Sonnet judge for scoring responses
- Segments responses into attempts based on explicit self-correction phrases
- Scores each attempt 0-100 on topical relevance
- Cross-validated with 4 other judge models (GPT-5-Mini, Qwen3-32B, Haiku-4.5, Gemini-2.5-Flash) - all show consistent ESR rankings

### Feature Selection
- **[`sample_features.py`](sample_features.py)**: Samples filtered SAE latents for steering
- **[`relevance_filtering.py`](relevance_filtering.py)**: Excludes latents naturally activated by prompts
- **[`concreteness_filtering.py`](concreteness_filtering.py)**: Filters for concrete, domain-specific latents (easier to detect as off-topic)

### Threshold Finding
- **[`threshold_finder.py`](threshold_finder.py)**: Probabilistic Bisection Algorithm for finding boost thresholds
- Efficiently identifies boost value yielding average first-attempt score of 50/100

### Utilities
- **[`experiment_config.py`](experiment_config.py)**: Configuration dataclasses
- **[`experiment_dataclasses.py`](experiment_dataclasses.py)**: Data structures for results
- **[`result_file_utils.py`](result_file_utils.py)**: Loading/saving experiment results
- **[`utils.py`](utils.py)**: Helper functions

---

## How to Run

### Prerequisites

```bash
# Install dependencies
bash install.sh

# Set up environment variables in .env
HF_TOKEN=hf_...
ANTHROPIC_API_KEY=sk-ant-...
OPENROUTER_API_KEY=sk-or-...  # Optional, for alternative judges
```

### Running Experiments

> **Note**: For Gemma models, use `./python_for_gemma.sh` instead of `python`. This wrapper sets required environment variables (`VLLM_ATTENTION_BACKEND=FLASHINFER VLLM_FLASH_ATTN_VERSION=3`) for Gemma's attention mechanism.

**Experiment 1 (ESR across models)**:
```bash
python experiment_1_esr.py 70b  # or 8b
./python_for_gemma.sh experiment_1_esr.py gemma-2b  # or gemma-9b, gemma-27b
python plotting/plot_exp1.py
```

**Experiment 2 (Boost ablation)**:
```bash
python experiment_2_multi_boost.py 70b
python plotting/plot_exp2.py
```

**Experiment 3 (Off-topic detector ablation)**:
```bash
cd experiment_3_off_topic_detectors
python find_off_topic_detectors.py 70b
python experiment_3_with_ablation.py 70b --ablate ../data/off_topic_detectors_Meta-Llama-3.3-70B-Instruct.json
python ../plotting/plot_exp3.py Meta-Llama-3.3-70B-Instruct
```

**Experiment 4 (Fine-tuning)**:
```bash
cd experiment_4_finetuning
# Generate datasets
uv run python setup_masked_ratio_sweep.py
# Train models
bash train.sh
# Evaluate
bash run_esr.sh
# Plot results
python ../plotting/plot_exp4.py
```

**Experiment 5 (Meta-prompting)**:
```bash
python experiment_5_prompt_variants.py 70b --from-results <baseline_results.json>
python plotting/plot_exp5.py
```

### Generate All Plots

To generate all plots into a timestamped folder:

```bash
python plotting/plot_all.py
```
