# Existing Adversarial Large Language Model Unlearning Evaluations Are Inconclusive

## Environment

Two working options (pick one):

1) General benchmarking: `src/env.yml`

```
conda env create -f src/env.yml
conda activate unlearning
```

2) Enhanced GCG/FLRT extras: `src/enhanced_gcg/env.yml`

```
conda env create -f src/enhanced_gcg/env.yml
conda activate flrt
```

Finetuning has its own minimal env: `src/finetuning/env.yml`.

> Note: the repo’s cleaning script strips any conda `prefix:` lines to keep paths anonymous.

## Where Things Live

- MCQ benchmarking (accuracy): `src/benchmarking/benchmark.py`
- Perplexity evaluation: `src/benchmarking/perp_eval.py`
- Finetuning ("relearning attack"): `src/finetuning/finetune.py`
- Prompt optimization helpers: `prompt_optimization/` and `src/enhanced_gcg/`
- Simple ACR-style run script: `acr-wmdp.py` with config in `config/config.yaml`
- Ready-to-run examples: `scripts/benchmark_mcq.sh`, `scripts/benchmark_perp.sh`, `scripts/relearn_attack.sh`

## MCQ Benchmark (WMDP)

Basic run (applies chat template by default):

```
python -m src.benchmarking.benchmark \
  --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
  --tasks wmdp-bio
```

With an adversarial init (defined in `benchmark.py` via `--adv_prefix`):

```
python -m src.benchmarking.benchmark \
  --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
  --tasks wmdp-bio \
  --adv_prefix llama3-1b-3
```

Common flags:
- `--tasks`: comma‑sep subset of `{wmdp-bio,wmdp-chem,wmdp-cyber,mmlu,tofu-qa}`
- `--ignore_chat_template`: disable chat template
- `--tokenizer_name_or_path`: override tokenizer (e.g. `HuggingFaceH4/zephyr-7b-beta`)

Outputs land under `results/baselines/<model>/results.jsonl`.

## Perplexity Evaluation (MCQ log‑loss)

```
python -m src.benchmarking.perp_eval \
  --model_name_or_path meta-llama/Llama-3.2-1B-Instruct \
  --tokenizer HuggingFaceH4/zephyr-7b-beta \
  --tasks wmdp-bio \
  --batch_size 1 \
  --append_options
```

Supported `--tasks`: `wmdp-bio`, `wmdp-chem`, `wmdp-cyber`, `gpqa-all`, `tofu`.
Results are saved to `results/perp_eval/<model>_<task>.jsonl` and the accuracy is logged to W&B.

## Finetuning (“Relearning Attack”)

Run a small LoRA finetune and evaluate:

```
python -m src.finetuning.finetune \
  --model J4Q8/zephyr-npo-bio \
  --tokenizer HuggingFaceH4/zephyr-7b-beta \
  --dataset wmdp_bio-retain-corpus-mc \
  --n_samples 10 \
  --epochs 3 \
  --lora_rank 128 --lora_alpha 16 \
  --lr 2e-4 --weight_decay 0.01 \
  --batch_size 1 \
  --eval_dataset wmdp-bio
```

Datasets are defined in `src/util/data.py` (see `DATASET_REGISTRY` in `src/finetuning/finetune.py` for options).
Convenience SLURM script: `scripts/relearn_attack.sh`.

## Simple ACR Runner

You can also run the minimal driver used for small ACR‑style sweeps:

```
python acr-wmdp.py
```

Configuration lives in `config/config.yaml` (model, dataset, style, steps, etc.).
Results are written to `results/<model>_<dataset>_<style>_<start>_<end>.json`.

## Notes

- All datasets are loaded via Hugging Face (`cais/wmdp`, `zekeZZ/gpqa_all`, `zekeZZ/tofu_wiki_qa_shuffled`).
- For Zephyr‑based models, it’s often useful to set `--tokenizer_name_or_path HuggingFaceH4/zephyr-7b-beta`.
- The `scripts/` directory includes SLURM directives; if you’re not on SLURM, you can copy the core python lines.
