# Data Attribution Driven LLM Unlearning

## Preliminaries

run `bash set_env.sh` or run:

```
conda create -n unlearn python=3.10
conda activate unlearn
conda install pytorch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 pytorch-cuda=12.4 -c pytorch -c nvidia
conda install -c "nvidia/label/cuda-12.4.0" cuda-toolkit nccl
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```

## Evaluation Pipline

if you have multi-GPU devices, please indicating your devices.

```
# Multi-GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 xxx.py

# Single-GPU
CUDA_VISIBLE_DEVICES=0 python xxx.py
```

### Finetune:

```
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 finetune.py --config-name=finetune.yaml split=full batch_size=1 gradient_accumulation_steps=1 model_family=phi lr=1e-5 num_epochs=1
```

### Data attribution:

Indication your `model_path`:

```
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 data_attribution.py --config-name=data_attribution.yaml split=forget01 model_family=phi unify_method=exp attribution_method=g_norm model_path=... 
```

### Unlearning the Finetuned Model:

indication your `model_path` and `score_dict_path`:

```
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 forget.py --config-name=forget.yaml split=forget01 batch_size=1 gradient_accumulation_steps=1 model_family=phi lr=1e-5 model_path=... score_dict_path=... 

# Or use the pipeline parallelism
python forget.py --config-name=forget.yaml split=forget01 batch_size=1 gradient_accumulation_steps=1 model_family=phi lr=1e-5 model_path=... score_dict_path=... +strategy=pipeline
```

### Evaluation

Indicating your forgoten `model_path`:

```
CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nproc_per_node=4 evaluate_all.py model_family=phi model_path=... ckpt_result=...
```

### Present rouge:

```
CUDA_VISIBLE_DEVICES=0 python get_rouge.py file_path=...
```