# PRISM

## Installation

To create corresponding conda environment that we use in the experiment for Python 3.10, run:
```bash
conda env create -f environment.yml
conda activate muse
```

## Get the data & origin models

- Two corpora `News` and `Books` and the associated target models are available at `MUSE` huggingface. 

    | News | [Target model](https://huggingface.co/muse-bench/MUSE-News_target) | [Dataset](https://huggingface.co/datasets/muse-bench/MUSE-News) |

    | Books | [Target model](https://huggingface.co/muse-bench/MUSE-Books_target) | [Dataset](https://huggingface.co/datasets/muse-bench/MUSE-Books) | 

-  The WMDP dataset used
in our experiments focuses on conversational dialogues, derived from the original benchmark and further constructed by generating and paraphrasing question–answer pairs with gpt-4o-
mini, referencing the original WMDP QA datasets. 
- Before proceeding, load all the data from HuggingFace to the root by running the following instruction:
    ```
    python load_data.py
    ```

## Get the unlearned model
1. Run `run_muse_unlearn.sh` in the `baselines` folder.
    - `algo`: The unlearning algorithm to use (e.g., `prism_npo_gdr`).
    - `model_dir`: Path to the target model directory.
    - `tokenizer_dir`: Path to the tokenizer directory.
    - `data_file`: The forget set used for unlearning.
    - `retain_data_file`: The retain set used for GDR regularization if required by the algorithm.
    - `out_dir`: Directory to save the unlearned model (default: ckpt).
    - `max_len`: Maximum input sequence length (default: 4096).
    - `per_device_batch_size`, `epochs`, `lr`, `pretrained_probe_path`, `adv_gamma`, `select_layer`: Hyperparameters for controlling the training process and model behavior.

## Get the relearned model
- Run `run_muse_relearn.sh` in the `baselines` folder.


## Evaluate the unlearned model

- To evaluate MUSE unlearned model(s), run `eval.py` from the root of this repository with the following command-line arguments:

- `--model_dirs`: A list of folders containing your unlearned models; each entry may be a HuggingFace repo path or a local directory.
- `--names`: One unique label per entry in `--model_dirs`; the number of names must equal the number of model directories.
- `--corpus`: The evaluation corpus to use—choose either `news` or `books`.
- `--out_file`: Filename for the CSV output. Each row represents one unlearning method from `--model_dirs`, and columns correspond to the metrics chosen via `--metrics`.
- `--tokenizer_dir` (Optional): Path to the tokenizer. Defaults to `meta-llama/Llama-2-7b-hf` (the default tokenizer for LLaMA).
- `--metrics` (Optional): Metrics to compute. Options: `verbmem_f` (VerbMem Forget), `privleak` (PrivLeak), `knowmem_f` (KnowMem Forget), `knowmem_r` (KnowMem Retain, i.e., Utility). By default, all are evaluated.
- `--temp_dir` (Optional): Directory for intermediate files. Default: `temp`.

- Run the following command with placeholder values:

    ```python
    python eval.py \
    --model_dirs "repo/model1" "repo/model2" \
    --names "model1" "model2" \
    --corpus books \
    --out_file "out.csv"
    ```

- To evaluate WMDP unlearned model(s), run the following script from `lm-evaluation-harness`:
    ```python
    lm_eval --model hf \
    --model_args pretrained=wmdp \
    --tasks hellaswag \
    --device cuda:0 \
    --batch_size 8
    ```

- `model`: Backend used to load the model. Use hf to evaluate a Hugging Face Transformers model.
- `--model_args`: Initialization arguments for the backend. Example: pretrained=wmdp loads a model named wmdp (HF repo ID or local path). You can pass multiple comma-separated args, e.g., pretrained=meta-llama/Llama-2-7b-hf,revision=main.
- `tasks`: Benchmark task list to run. Single task: hellaswag. Multiple tasks are comma-separated with no spaces, e.g., mmlu,hellaswag,wmdp.
- `device`: Compute device selection, e.g., cuda:0 (first GPU) or cpu.
- `batch_size`: Per-device evaluation batch size; higher is faster but requires more memory.

## Jailbreak Evaluation

In the jailbreak/ directory, you should run two files in sequence given the test cases:
- First, run `gen_response.py` to generate the model responses for the provided jailbreak test cases.
- Then, run `eval.py` to evaluate those responses and produce the corresponding metrics.