# 🎭 Persona Vectors + GRPO: Steering Traits While Optimizing Math Correctness

This fork adapts **Persona Vectors** to a new use case: **training with GRPO to optimize correct math** and comparing runs **with and without persona-vector steering**.

The persona-vector pipeline is retained (extract activations with positive/negative persona prompts → compute a vector), but **training is now done via GRPO** (`grpo.py`) instead of the original LoRA SFT training scripts.

---

## 🚀 Quick Start

### ⚙️ Setup

1. Create and activate a virtual environment:
```bash
python -m venv .venv
source .venv/bin/activate
```

2. Install dependencies:
```bash
pip install -r requirements.txt
```

3. Configure environment variables:
```bash
cp .env.example .env
# Fill in your API keys (used for judge/eval)
```

---

## 1) 🧬 Generating Persona Vectors

This section generates a persona vector for the trait **`evil`** using **unsloth/Qwen2.5-3B-Instruct**.

### A) Extract activations with positive and negative persona system prompts

**Positive (trait-present) extraction:**
```bash
python -m eval.eval_persona \
    --model unsloth/Qwen2.5-3B-Instruct \
    --trait evil \
    --output_path eval_persona_extract/qwen2_5/evil_pos_instruct.csv \
    --persona_instruction_type pos \
    --assistant_name evil \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract
```

**Negative (trait-absent) extraction:**
```bash
python -m eval.eval_persona \
    --model unsloth/Qwen2.5-3B-Instruct \
    --trait evil \
    --output_path eval_persona_extract/qwen2_5/evil_neg_instruct.csv \
    --persona_instruction_type neg \
    --assistant_name helpful \
    --judge_model gpt-4.1-mini-2025-04-14  \
    --version extract
```

**Assistant name note:** this fork keeps the original convention: we prepend an instruction like  
`"You are a [assistant_name] assistant."`  
- For **positive** prompts, use the trait adjective (e.g., `evil`)  
- For **negative** prompts, use an antonym if clear, otherwise `helpful`

### B) Compute the persona vector

```bash
python generate_vec.py \
    --model_name unsloth/Qwen2.5-3B-Instruct \
    --pos_path eval_persona_extract/qwen2_5/evil_pos_instruct.csv \
    --neg_path eval_persona_extract/qwen2_5/evil_neg_instruct.csv \
    --trait evil \
    --save_dir persona_vectors/qwen2_5
```

This produces persona-vector artifacts in:
- `persona_vectors/qwen2_5/`

---

## 2) 🏋️ GRPO Training (Regular)

Train with GRPO on the objective of **math correctness**, without persona-vector steering:

```bash
python grpo.py grpo.json
```

---

## 3) 🧭 GRPO Training + Persona Vectors (Steered)

Train with GRPO, **with persona-vector steering enabled** (as configured in `grpo_steer.json`):

```bash
python grpo.py grpo_steer.json
```

---

## Notes

- This fork focuses on **GRPO math optimization** and uses persona vectors to compare behavioral/trait steering effects during training.
- Evaluation utilities (e.g., `eval.eval_persona`) still rely on an LLM judge model (configured via `.env` and `--judge_model`).