# Reproducibility Code for Steering Vector Transfer via Orthonormal Transformations

This repository contains the code to reproduce all experiments and figures from our paper on transferring steering vectors across language models.

## Requirements

- Python 3.8+
- numpy>=1.21.0
- scipy>=1.7.0
- matplotlib>=3.5.0
- seaborn>=0.11.0
- tqdm>=4.62.0
- pathlib

## Installation

```bash
pip install numpy scipy matplotlib seaborn tqdm
```

## Directory Structure

```
reproducibility/
├── alignment/          # Main alignment and transfer experiments
│   ├── run_transfer_experiments.py  # Main transfer pipeline
│   └── run_scrambling.py           # Scrambling hierarchy experiments
├── figures/            # Figure generation scripts
│   └── generate_figures.py         # Generate all paper figures
├── results/            # Output directory for results (created automatically)
└── vectors/           # Input steering vectors (user must provide)
    └── raw/
        └── {model}_{trait}_vectors_webscale.npy
```

## Data Format

The code expects steering vectors in NumPy format with the following naming convention:
```
{model}_{trait}_vectors_webscale.npy
```

Where:
- `model` is one of: `gemma`, `llama3`, `mistral`
- `trait` is one of the 26 behavioral traits listed below

### Required Traits

```python
traits = [
    'accessibility', 'assertiveness', 'authority', 'certainty', 'clarity',
    'concreteness', 'creativity', 'directness', 'emotional_tone', 'empathy',
    'enthusiasm', 'formality', 'hedging', 'humor', 'inclusivity',
    'objectivity', 'optimism', 'persuasiveness', 'politeness', 'precision',
    'professionalism', 'register', 'specificity', 'technical_complexity',
    'urgency', 'verbosity'
]
```

## Running Experiments

### 1. Main Transfer Experiments

Run cross-model transfer experiments with 5 random seeds:

```bash
cd alignment/
python run_transfer_experiments.py
```

This will:
- Load steering vectors for all model pairs
- Run PCA + Similarity Procrustes alignment
- Perform 5 runs with different train/test splits
- Output results to `results/transfer_results_5runs.json`

Expected output format:
```
Test Cosine: 0.559 ± 0.008
Scale Factor: 0.727
Train-Test Gap: 0.004
```

### 2. Scrambling Hierarchy Experiments

Test the importance of semantic pairing:

```bash
cd alignment/
python run_scrambling.py
```

This will test three conditions:
- **Proper pairing**: Instance-level correspondence preserved (expected ~0.530)
- **Within-trait shuffling**: Trait identity preserved, instances shuffled (expected ~0.308)
- **Cross-trait shuffling**: Complete randomization (expected ~0.000)

Output saved to `results/scrambling_results.json`

### 3. Generate Figures

Create all paper figures:

```bash
cd figures/
python generate_figures.py
```

Generated figures:
- `fig1_transfer_heatmap.pdf`: Transfer performance matrix
- `fig2_scrambling_hierarchy.pdf`: Semantic pairing ablation
- `fig4_trait_performance.pdf`: Per-trait performance bar chart

## Key Methods

### PCA + Similarity Procrustes Alignment

Our method performs alignment in three steps:

1. **PCA Projection**: Reduce dimensionality to k=1300 dimensions
2. **Procrustes Alignment**: Learn orthogonal transformation R and scale s
3. **Transfer**: Apply learned transformation to test vectors

### Evaluation Metric

We use cosine similarity between aligned source and target vectors:

```python
cos_sim = dot(x_aligned, y) / (norm(x_aligned) * norm(y))
```

## Expected Results

| Model Pair | Test Cosine | Scale Factor |
|------------|-------------|--------------|
| Gemma → LLaMA | 0.559 | 0.727 |
| Gemma → Mistral | 0.513 | 0.722 |
| LLaMA → Gemma | 0.559 | 1.000 |
| LLaMA → Mistral | 0.516 | 0.841 |
| Mistral → Gemma | 0.513 | 0.926 |
| Mistral → LLaMA | 0.516 | 0.868 |
| **Mean** | **0.529** | 0.847 |

## Hyperparameters

- **k_dims**: 1300 (PCA dimensions)
- **Train/test split**: 80/20
- **Random seeds**: [42, 123, 456, 789, 1011]
- **Max vectors per trait**: 2500 (for memory efficiency)

## Citation

If you use this code, please cite our paper:

```bibtex
@inproceedings{anonymous2026steering,
  title={Steering Vector Transfer via Orthonormal Transformations and Semantic Pairing},
  author={Anonymous},
  booktitle={ICLR},
  year={2026}
}
```

## Contact

For questions about the code, please open an issue in the repository.