# Unlearning Msc Project


## Instructions

The workflow is that you run the `train.py` to unlearn, setting the dataset that you want in the file. The results and by default 10 epochs are written to the results folder. To evaluate this run the `evaluate.py` to calculate the metrics.


## Project Structure

`egu/trainers` => different training methods that inherits from the `transformers` `Trainer` class.

`egu/models` => different model loader for different PEFT strategies

`egu/evaluators` => helpers and utils to run evaluation, currently not being used, since hardcode to get things work for now

`egu/dataset` => holds the special `data collator` and the `dataset loader`. More in the section below

`egu/utils` => more generic helper and calculation functions

### Unlearn


`unlearn.py` => train the model to unlearn the concept, currently only does for TOFU

### Evalaution

`evaluate.py` => runs the bleu, accuracy, rouge, recall after unlearning, we want this to go down

### Notes on Dataloaders and Dataset

The Dataloader is quite complicated because instead of loading in a single data point for each batch, so it would be have been something like (question + answer). We have to load in both the forget and retain set, essentially each item contains ((retain: question + answer), (forget: question + answer)). This is because for convenient to the trainers as some of them use both. In the cases for DPO and KTO they have what is called a "I don't know" dataset which are fix phrases like "I am not permit to answer this". Essentially the dataloader just randomly selects from a list of phrases, for more details check `egu/dataset/idontknow.jsonl`

## Useful commands

### List all the files
du -h | sort -h

### 

Directory: /cs/student/projects3/csml/2024/pchaiyap

### Bash + Training script

Remember to set your hugging face to save the epochs

`export HUGGINGFACE_HUB_TOKEN=`

Select the one GPU to use

`export CUDA_VISIBLE_DEVICES=0`
and assuming that you have accelerate config setup you can just run the instruction below, if not use the command line [accelerate cli](https://huggingface.co/docs/accelerate/en/package_reference/cli)

`accelerate launch train.py`

## Metrics

- BLEU
- ROUGE
- ACCURACY (check softmax of next token)
- RECALL
- PPL

### Evaluation

#### Local Directory

`python eval.py --model_id results/8_bit_npo_tofu_llama-2-7b_lora_tofu/forget10/epoch-9`

#### Huggingface

`python eval.py --model_id your-org/tofu-llama2-7b-npo-forget10 --dtype bf16`


#### Evaluation with LoRA Adapter


`python eval.py \
  --model_id open-unlearning/tofu_Llama-2-7b-chat-hf_full \
  --adapter_id your-org/tofu-llama2-7b-npo-forget10-adapters \
  --dtype fp16`

### Remote Caveat

1. Remove vastai from requirements since got dependencies locked
2. install `pip install mpi4py openmpi`, `openmpi` is what makes it compaitible

### Note to self

- The retain set baseline is from the FT results from NPO repo.

- on vast.ai if cannot install:

pip install --force-reinstall torchvision --index-url https://download.pytorch.org/whl/cu121





