# Trait Basis Vectors

This directory contains two helper scripts for (1) computing "trait" steering vectors from conversation logs and (2) using those vectors to steer a chat model at inference time with varying strengths.

## Prerequisites

- Python 3.10+
- GPU with bfloat16 support is strongly recommended; the scripts fall back to CPU but will be slow.
- Install dependencies once via:

```bash
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## 1. Prepare Chat Histories

`generate_vector.py` expects a JSON file containing a list of objects with `real` and `contrastive` chat transcripts (see `chat_histories/impatience.json` for an example). Each transcript should alternate user/assistant roles exactly as expected by the Hugging Face chat template.

## 2. Generate Trait Vectors

Run `generate_vector.py` to compute the difference between average real vs. contrastive activations for a given trait:

```bash
python generate_vector.py \
  --chat_history_file chat_histories/impatience.json \
  --trait impatience \
  --model_name meta-llama/Llama-3.1-8B-Instruct \
  --vector_type response
```

Key flags:
- `--chat_history_file`: path to the prepared chat log JSON.
- `--trait`: label used in the saved file name.
- `--model_name`: any HF model ID supported by `AutoModelForCausalLM` (must match what you will use for inference).
- `--vector_type`: `prompt` or `response` segment of each conversation to average.

The script stores the resulting tensor in `activations/diff_vectors_<vector_type>_<trait>_<model>.pt` and prints the per-example cross-entropy losses for sanity checking.

## 3. Run Inference With Steering

Use `inference_vector.py` to inject the saved steering vector into a specific transformer layer while generating responses for different prompts/strengths:

```bash
python inference_vector.py \
  --trait impatience \
  --layer 12 \
  --prompt_type router \
  --strengths 0 1.5 3.0 \
  --model_name meta-llama/Llama-3.1-8B-Instruct
```

Important arguments:
- `--trait`: must match the trait name used in step 2 (the script looks in `activations/`).
- `--layer`: transformer block index where the steering vector is added (0-indexed, no embedding layer).
- `--prompt_type`: selects one of the built-in conversation seeds (`refund`, `router`, `job`, `transformer`). Replace or modify as needed.
- `--strengths`: list of scalars applied to the vector; higher magnitudes enforce the trait more strongly.

For each strength, the script prints the assistant response and writes the collected outputs to `steering_results/<trait>_<layer>_<prompt>.json` for later inspection.

## Tips

- Expect to experiment with both `layer` and `strength` to find a sweet spot; too much steering can derail coherence.
- If you generate vectors for multiple models, keep the `activations/` directory organized per model name (already encoded in the default file naming scheme).
- To steer with a custom prompt, replace `load_prompt()` or pass your own message list to `generate_steered_response` if you integrate the helper into other scripts.
