# SteerCLR: Unsupervised Steering Vectors for Language Models

This repository contains a clean, reproducible implementation of SteerCLR, a method for learning steering vectors for language models in an unsupervised manner. The method learns diverse steering vectors that can modify model behavior without requiring labeled data.

## Overview

SteerCLR learns steering vectors by:
1. **Contrastive Learning**: Using contrastive loss to ensure diverse steering directions
2. **Activation Steering**: Injecting learned vectors into model activations during inference
3. **Unsupervised Training**: Learning from unlabeled text data without requiring behavior annotations

## Installation

### Option 1: Using pip (recommended)
```bash
pip install -r requirements.txt
```

### Option 2: Using uv (if available)
```bash
uv sync
```

### Option 3: Development installation
```bash
pip install -e .
```

## Quick Start

### 1. Training Steering Vectors

Train steering vectors using the demo configuration:
```bash
./scripts/train.sh
```

Or train with a specific configuration:
```bash
./scripts/train.sh configs/qwen_config.yaml
```

### 2. Using Trained Steering Vectors

After training, demonstrate steering behavior:
```bash
./scripts/demo_steering.sh outputs/demo/[experiment_directory]
```

Or use steering vectors directly:
```bash
python scripts/run_steering.py \
    --vectors outputs/demo/[experiment_directory]/steering_vectors.pt \
    --vector-idx 0 \
    --coefficient 1.5 \
    --prompt "Your prompt here"
```

## Project Structure

```
.
├── configs/                    # Configuration files
│   ├── demo_config.yaml       # Lightweight demo configuration
│   └── qwen_config.yaml       # Configuration for Qwen2.5-7B-Instruct
├── data/                      # Training and validation data
│   ├── train/                 # Training datasets (CAA format)
│   └── validation/            # Validation datasets
├── scripts/                   # Executable scripts
│   ├── train.sh              # Training script (bash)
│   ├── train.py              # Training script (python)
│   ├── demo_steering.sh      # Steering demonstration script
│   └── run_steering.py       # Interactive steering script
├── src/                      # Source code
│   ├── steering/             # Steering vector implementation
│   └── unsupervised_steering/ # SteerCLR training implementation
├── outputs/                  # Training outputs (created during training)
├── pyproject.toml           # Project configuration
├── requirements.txt         # Dependencies
└── README.md               # This file
```

## Configuration

Configuration files are in YAML format and use Pydantic for validation. Key parameters:

### Model Configuration
- `model_name`: HuggingFace model identifier
- `target_layer`: Layer index for activation capture
- `source_layer`: Layer index for activation steering
- `source_submodule`: Specific submodule to hook (e.g., "mlp.down_proj")

### Training Configuration
- `n_training_steps`: Number of training steps
- `n_vectors`: Number of steering vectors to learn
- `learning_rate`: Optimizer learning rate
- `batch_size`: Training batch size

### Loss Function Configuration
- `alpha`: Weight for magnitude loss
- `beta`: Weight for diversity loss
- `lambda_`: Weight for orthogonality loss
- `diversity_loss_type`: Type of contrastive loss ("ntxent", "supcon", etc.)

## Data Format

Training data should be in JSON format with the following structure:
```json
[
    {
        "question": "Question text with choices...",
    }
]
```

Validation data (for open-ended generation):
```json
[
    {
        "question": "Open-ended question text"
    }
]
```

## Examples

### Training a Small Model (Demo)
```bash
# Uses DialoGPT-small for quick demonstration
./scripts/train.sh configs/demo_config.yaml
```

### Training a Larger Model
```bash
# Uses Qwen2.5-7B-Instruct (requires ~16GB GPU memory)
./scripts/train.sh configs/qwen_config.yaml
```

### Custom Steering
```bash
# Load trained vectors and test custom prompts
python scripts/run_steering.py \
    --vectors outputs/demo/[experiment]/steering_vectors.pt \
    --vector-idx 5 \
    --coefficient 2.0 \
    --prompt "How should I approach a difficult conversation?" \
    --max-new-tokens 256 \
    --temperature 0.7
```

## Output Structure

After training, each experiment creates a directory with:
```
outputs/[model_name]/[timestamp_experiment_id]/
├── config.yaml              # Training configuration
├── steering_vectors.pt       # Learned steering vectors
├── training_log.txt         # Training logs
├── validation_generations.jsonl  # Generated text samples
```
