# Multi-turn Jailbreak (Unified Feedback Pipeline)

## Setup

### Quick setup

```sh
bash setup.sh
```

### Environment variables
Required by the current scripts:

```sh
export OPENAI_API_KEY=your_openai_api_key_here
export HF_TOKEN=your_huggingface_token_here
export HF_HOME=$SCRATCH/LLMs
```

Optional (only if using AWS / Bedrock backends):

```sh
export AWS_ACCESS_KEY_ID=your_aws_access_key_id
export AWS_SECRET_ACCESS_KEY=your_aws_secret_access_key
export AWS_DEFAULT_REGION=us-east-1
```

Persist them in your shell rc:

```sh
echo 'export OPENAI_API_KEY=your_openai_api_key_here' >> ~/.bashrc
echo 'export HF_TOKEN=your_huggingface_token_here' >> ~/.bashrc
echo 'export HF_HOME=$SCRATCH/LLMs' >> ~/.bashrc
source ~/.bashrc
```

## Environments

### `sglang-env` (for running the SGLang server)

```sh
conda create -n sglang-env python=3.12
conda activate sglang-env
pip install flashinfer-python
pip install --upgrade pip
pip install uv
uv pip install sglang --prerelease=allow
```

### `multi-turn-new` (for running experiments)

```sh
# Option 1: Let conda choose location automatically
conda env create -f environment.yml
conda activate multi-turn-new

# Option 2: Specify custom location using $SCRATCH
conda env create -f environment.yml --prefix $SCRATCH/miniconda3/envs/multi-turn-new
conda activate $SCRATCH/miniconda3/envs/multi-turn-new
```

## Local servers

### SGLang (OpenAI-compatible server for attacker/local models)
Default port is `30000`.

Run in background (recommended):

```sh
export HF_HOME=$SCRATCH/LLMs
bash deploy_sglang_background.sh
```

Check status:

```sh
curl http://localhost:30000/health
tail -f logs/sglang_server.log
```

Stop:

```sh
pkill -f "sglang.launch_server.*30000" || true
```

Run in foreground (blocks terminal):

```sh
export HF_HOME=$SCRATCH/LLMs
bash deploy_sglang.sh
```

### vLLM rewrite server (optional; used by mislead defense in server mode)
Default port is `8000`, served model name defaults to `increase`.

```sh
bash deploy_vllm_background.sh --server-port 8000 --served-model-name increase
curl http://localhost:8000/health
tail -f logs/vllm_server.log
```

### vLLM guard server (optional; used by Guard defense)
Default port is `30001`, served model name defaults to `guard`.

```sh
bash deploy_vllm_guard_background.sh --server-port 30001 --served-model-name guard
curl http://localhost:30001/health
tail -f logs/vllm_guard_server.log
```

## Run experiments

### Complete pipeline (recommended)
`run_full_pipeline.sh` automates server checks/startup + running experiments + (optional) summary.

Quick test (recommended; runs 1 behavior):

```sh
bash run_full_pipeline.sh --num-behaviors 1
```

Notes about current defaults (see `bash run_full_pipeline.sh --help` for the authoritative list):
- Starts SGLang on `localhost:30000` unless you pass `--server-node` (remote) or `--no-server`.
- Default `--plan-generator` is `crescendo`.
- If `--num-behaviors` is omitted, the script processes **ALL** behaviors in the dataset.
- `--force` is **disabled** by default (existing `outputs/<name>/behavior_*.json` will be skipped).
- Mislead defense is **disabled** by default; Guard is **disabled** by default.

Common examples:

```sh
# Use a different plan generator
bash run_full_pipeline.sh --plan-generator xteaming --num-behaviors 10

# Explicit updater (plan generator and updater can be different)
bash run_full_pipeline.sh --plan-generator crescendo --updater xteaming --num-behaviors 10

# Re-run from scratch (delete outputs/<name>/ first)
bash run_full_pipeline.sh --num-behaviors 10 --force

# Enable defenses
bash run_full_pipeline.sh --num-behaviors 10 --enable-mislead
bash run_full_pipeline.sh --num-behaviors 10 --enable-proact
bash run_full_pipeline.sh --num-behaviors 10 --enable-guard
```

### Using remote servers
If SGLang is running on a different node (OpenAI-compatible):

```sh
bash run_full_pipeline.sh --server-node nid001040 --num-behaviors 10
```

When `--server-node` is not `localhost/127.0.0.1`, the script will **not** start a local server; it only checks accessibility.

If your configs need explicit base URLs, set them like:

```yaml
target_model:
  base_url: "http://nid001040:30000/v1"
attacker_model:
  base_url: "http://nid001040:30000/v1"
```

### Results
Outputs are written under `outputs/<output_name>/` as `behavior_<idx>.json`.

```sh
ls -lh outputs/
python analysis.py
```

## Manual run (advanced)

```sh
conda activate multi-turn-new
python main.py \
  --plan_generator xteaming \
  --updater xteaming \
  --number_of_behaviors 10 \
  --output_name TestRun
```

**Available plan generators / updaters:** `xteaming`, `actorattack`, `coa`, `crescendo`, `fitd`

## Results figure
![Results of Unified Feedback](./analysis_summary.png)
