# Unlearning via Domain-wise Subspace Erasure and Explicit Representation Alignment

## Supplementary Materials: Source Code Implementation of ERASER


## Overview

**ERASER** is our proposed method for unlearning specific knowledge from large language models (LLMs). The core implementation is located in `eraser/eraser.py`. This script performs targeted forgetting by modifying internal representations of selected layers.

## How It Works

To run ERASER, execute the following command:

```bash
python -m eraser.eraser [arguments]
```

The script supports a wide range of arguments to control model input, unlearning targets, training configuration, and advanced options like LoRA and subspace steering.

## Argument Descriptions
```bash
--model_name_or_path (str): Path to the pretrained model (local directory or Hugging Face ID).
Default: "/data/models_ckpt/zephyr-7b-beta"

--module_str (str): Template string to specify the exact layer to modify.
Default: "{model_name}.base_model.model.model.layers[{layer_id}]"

--output_dir (str): Directory to save the modified model.
Default: "/llm_unlearning/wmdp/models/unlearned_model"

--retain_corpora (str): Comma-separated names of corpora to retain.
Default: "wikitext,wikitext"

--forget_corpora (str): Comma-separated names of corpora to forget.
Default: "bio-forget-corpus,cyber-forget-corpus"

--alpha (str): Loss weighting for each forget domain.
Default: "1200,1200"

--lr (float): Learning rate during unlearning.
Default: 5e-5

--min_len / --max_len (int): Min/Max sequence length for inputs.
Default: 50 / 2000

--batch_size (int): Number of samples per batch.
Default: 4

--max_num_batches (int): Maximum number of batches to process.
Default: 500

--layer_id (int): Single layer to apply unlearning.
Default: 7

--layer_ids (str): Comma-separated list of layers to modify.
Default: "5,6,7"

--epoch (int): Number of training epochs.
Default: 1

--seed (int): Random seed for reproducibility.
Default: 42

--verbose: Enable detailed logging.

--lora_layer_selection (str): Transformer layers for LoRA.
Default: "gate_proj,up_proj,down_proj"

--r (int): LoRA rank.
Default: 256

--lora_alpha (int): LoRA scaling factor.
Default: 512

--use_BCE_loss (bool): Whether to use binary cross-entropy loss.
Default: True

--use_lora (bool): Whether to apply LoRA.
Default: True

--use_TMP_INPUTS (bool): Use temporary variant of inputs (experimental).
Default: False

--lux_weight (float): Weight for auxiliary loss (e.g., disentanglement).
Default: 1.05

--num_PCs (int): Number of principal components used in subspace.
Default: 20

--num_samples_for_pcs (int): Number of samples to compute PCs.
Default: 100

--method_for_pcs (str): Method to compute PCs.
Default: "power_iteration"
```

## eraser/utils.py

The `utils.py` file contains various utility functions and classes required to run `eraser/eraser.py`. It implements core components of the ERASER unlearning method, including model manipulation, data preparation, and loss computation.

Key functionalities include:

- **LoRA Adapter Integration**  
  Utilities to apply LoRA adapters to selected transformer layers.

- **Subspace-Based Target Vector Construction**  
  Functions for computing and removing shared dominant directions across forget domains:  
  - `compute_shared_dominant_vectors`  
  - `remove_subspace`

- **Disentanglement Loss Computation**  
  A function to compute the loss for separating retain and forget representations:  
  - `compute_preference_loss`

- **DisentangleHead Module**  
  A lightweight neural module (two-layer MLP) used to project representations for contrastive disentanglement:  
  - `PreferenceHead`

- **Model and Dataset Utilities**  
  Functions to load pretrained models and prepare datasets:  
  - `load_model`  
  - `get_data`

These components are modular and designed to support experimentation with different unlearning strategies and architectures.


## How to Run
### 1. Select GPU
```python
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
```

### 2. Launch Unlearnning (ERASER)
```python
python3 -m eraser.eraser --verbose
```

### 3. Run Evaluation
```python
lm-eval --model hf \
    --model_args pretrained=models/unlearned_model \
    --tasks mmlu,wmdp \
    --batch_size=32
```

## data/

The `data/` directory contains all processed datasets used in unlearning and adversarial evaluation experiments.

### Contents

- **`wmdp_50_sampleds.csv`**  
  A modified subset of the WMDP benchmark (50 samples) used for jailbreak experiments.  
  This is a curated version derived from:  
  [https://github.com/centerforaisafety/wmdp](https://github.com/centerforaisafety/wmdp)

- **`processed_fictional_knowledge.json`**  
  Fictional factual knowledge used to perform continual learning on the model.  
  These facts are guaranteed to be **unseen** in the original training data.

- **`fictional_knowledge.json`**  
  Evaluation set for measuring how well the fictional knowledge has been forgotten.  
  Follows the evaluation protocol from:  
  [https://github.com/kaistAI/factual-knowledge-acquisition](https://github.com/kaistAI/factual-knowledge-acquisition)

### Notes

- The main training and evaluation datasets for WMDP unlearning experiments are available at:  
  [https://github.com/centerforaisafety/wmdp/tree/main](https://github.com/centerforaisafety/wmdp/tree/main)

- Fictional knowledge construction and evaluation methodology is based on:  
  [https://github.com/kaistAI/factual-knowledge-acquisition](https://github.com/kaistAI/factual-knowledge-acquisition)

## jailbreak attack/

The `jailbreak/` directory contains implementations of adversarial attack pipelines used to evaluate whether the unlearning method (ERASER) successfully removes targeted knowledge from LLMs.

### Structure

- **`gcg/`**  
  Implements the **Greedy Coordinate Gradient (GCG)** attack based on forgotten knowledge.  
  This folder includes:
  - GCG attack generation scripts.
  - Attack Success Rate (ASR) evaluation.  
  The implementation is adapted from:  
  [https://github.com/GraySwanAI/nanoGCG](https://github.com/GraySwanAI/nanoGCG)

- **`embedding_attack/`**  
  Contains code for **embedding space attacks** that manipulate hidden representations to elicit forgotten responses.  
  The implementation is based on:  
  [https://github.com/SchwinnL/LLM_Embedding_Attack](https://github.com/SchwinnL/LLM_Embedding_Attack)

These adversarial tests serve as a rigorous evaluation of the model’s ability to "forget" harmful or sensitive knowledge after applying ERASER.

## factual_knowledge/

The `factual_knowledge/` directory contains experiments designed to evaluate how effectively ERASER forgets **fictional knowledge** introduced through continual learning.

### Purpose

This setting tests whether the model can truly forget information that it did not originally possess. Specifically, we inject **fictional but syntactically valid factual knowledge** into the model via continual learning, and then assess the extent to which ERASER removes this injected knowledge.

### Contents

- Scripts for **continual learning** using fictional facts.
- Evaluation code based on three metrics:
  - `MEM`
  - `GEN` 
  - `HARD-GEN` 

The dataset and evaluation protocol follow the design proposed in:  
[https://github.com/kaistAI/factual-knowledge-acquisition](https://github.com/kaistAI/factual-knowledge-acquisition)

This benchmark provides a controlled and interpretable way to measure unlearning effectiveness on explicitly injected knowledge.

## Safeguards

This project includes a list of **ethical safeguards and considerations** related to the development and evaluation of unlearning methods.  
You can find these details in the `safeguards.txt` file included in the root directory of this repository.
