# SEAT-RL 

This repository contains the code and scripts to reproduce the experiments in the TMLR submission: a feedback-driven, black-box safety alignment testing framework for LLMs using deep reinforcement learning. It includes training loops, evaluation and analysis scripts, baselines, and defenses.

## Repository Layout
- `main.py`: PPO training entrypoint
- `evaluation.py`: policy evaluation helper
- `jailbreak_env.py`: prompt-generation environment
- `a2c_ppo_acktr/`: RL utilities and policy models
- `rephrase_baseline.py`: rephrasing baseline
- `random_policy.py`: random-action baseline
- `post_analysis.py`: post-hoc analysis
- `data_utils.py`: dataset loading utilities

## Setup
1. Create a virtual environment and install dependencies:
   ```bash
   python -m venv .venv
   source .venv/bin/activate
   pip install -r requirements.txt
   ```
2. Model access:
   - OpenAI targets: pass `--openai_key` to `main.py` or set `OPENAI_API_KEY`.
   - Local targets (Vicuna/Llama2/Falcon): ensure weights are available via `transformers` or `vllm`.

## Data
Place required files under `data/` (used by `jailbreak_env.py` and baselines):
- `advbench_questions.csv`, `unalign_response_advbench.csv`
- `question_list.csv`, `unalign_response.csv`
- `action2prompt.pkl`, `examples.pkl`
- Optional: `most_harmful_questions.csv`, `most_harmful_unalign_responses.csv`, `templates.csv`

## Running
### PPO Training
```bash
python main.py --algo ppo --models llama2 --datasets advbench --num-processes 16 --max_steps 5
```
Supported `--models` values include `vicuna`, `vicuna_13b`, `llama2`, `llama2_70b`, `falcon_40b`, or `gpt-*` (OpenAI).


### Evaluation
```bash
python main.py --algo ppo --evaluate --ckpt_path path/to/checkpoint.pt --models llama2 --datasets advbench
```

### Baselines and Analysis
- Random policy: `python random_policy.py`
- Rephrase baseline: `python rephrase_baseline.py`
- Post-analysis: `python post_analysis.py --file_path results.csv --target_model meta-llama/Llama-2-7b-chat-hf`

## Outputs
- Training logs: default in `/tmp/gym/`
- Checkpoints: `trained_models/` (set via `--save-dir`)
- Result CSVs and summaries: `saved_results/`

## Reproducibility
- Set `--seed` for controlled runs (default: 1). Multi-process and multi-GPU runs are not strictly deterministic.
- For double-blind review, we provide access to this repository and all supplemental artifacts through an anonymized link or package as instructed by the submission guidelines. Public release details (if applicable) will be provided upon acceptance.

