We use the repository[^1] proposed for the paper "SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning"[^2] as it provides efficient implementations of Gradient Ascent, Negative Preference Optimization, and KL divergence as unlearning loss functions.

<h4 style="text-align: center;">Unlearning Algorithms</h4>

We first provide an overview about each of mentioned unlearning algorithms, and then we provide the implementation details of them, suitable to be plugged into files in that repository.

These unlearning algorithms mainly consider two sets of documents, (i) *forget set*, and (ii) *retain set*. The forget set includes documents to be unlearned while the retain set includes documents that help the model keeping its utility. Let $D_\text{f}, D_\text{r}$ denote the forget and retain set, respectively.

Unlearning algorithms aim to forget documents in $D_\text{f}$ by maximizing a specific loss and retain utility on documents in $D_\text{r}$ by minimizing another loss. More formally, existing unlearning algorithms solve the following optimization problem

<!-- ![unlearning-algorithm](math/unlearning_general.png) -->
<div align="center">
    <img src="math/unlearning_general.png" alt="unlearning general" />
</div>

where $\mathcal{L}_\text{f}, \mathcal{L}_\text{r}$ refer to the loss functions over the documents in forget and retain set, respectively and $\lambda \geq 0$ is a regularization parameter to strike a balance between unlearning and utility preservation.

Let $P_\theta(x)$ be the probability distribution over the vocabulary for predicting the next token generated by the unlearned model, and let $P_\text{c}(x)$ represent the corresponding distribution from the corrupted model, given the input prompt $x$. Additionally, we use the notation $P_\theta(y|x)$ and $P_\text{c}(y|x)$ to represent the probability of sampling token $y$ given prompt $x$.

*Gradient Ascent.* &nbsp; This method uses the \textit{negative} training loss. Indeed, GA aims to maximize next-token-prediction loss over the tokens in the forget set. The formal loss function for a sample $\mathbf{x} \sim D_\text{f}$, consisting of $T$ tokens, can be expressed as
<!-- $$\mathcal{L}_{\text{GA}}(\mathbf{x}, \theta)=\frac{1}{T} \sum_i \log \left(P_\theta\left(\mathbf{x}_i \mid \mathbf{x}_{\lt i}\right)\right).$$ -->
<div align="center">
    <img src="math/GA.png" alt="GA" />
</div>

*KL Divergence.* &nbsp;  This method uses Kullback–Leibler divergence and aims to obtain a model with maximum KL divergence between the predictions on $D_\text{f}$ of the corrupted model and the unlearned model (as it undergoes unlearning). The formal loss function for a sample $\mathbf{x} \sim D_\text{f}$ including $T$ tokens can be written as
<!-- ![KL](math/KL.png) -->
<div align="center">
    <img src="math/KL.png" alt="KL" />
</div>

*Negative Preference Optimization.* &nbsp; This method casts the unlearning problem into the preference optimization framework by treating each (${x_{<i}}, {x_i}$) where ${x} \in D_\text{f}$ as only providing a negative response when ${x}_{<i}$ is prompted to the model. More formally, the loss function is
<!-- $$
    \mathcal{L}_{\text{NPO}}(\mathbf{x}, \theta)
    =
    \frac{2}{\beta T} \sum_i \log 
    \left( 1 + \left( \frac{P_\theta(\mathbf{x}_i | \mathbf{x}_{<i})}{P_\text{c}(\mathbf{x}_i | \mathbf{x}_{<i})}\right) ^ \beta \right)
$$ -->
<!-- ![NPO](math/NPO.png) -->
<div align="center">
    <img src="math/NPO.png" alt="NPO" />
</div>

where  $\beta > 0$ is the inverse temperature.

*Task Vector.* &nbsp; This methods aims to derive a parameter-space vector aligned with the influence of the forget set documents. It subsequently updates the corrupted model’s parameters by moving along the opposite direction of the vector. More formally, Let $\theta_c$ be the corrupted model's parameters, task vector continues fine-tuning corrupted model on $D_\text{f}$, and obtains the optimal parameters $\theta_*$.
Then the unlearned model's parameters are obtained as

<div align="center">
    <img src="math/Task-Vector.png" alt="Task Vector" />
</div>

where $\alpha > 0$ controls the step size.


<h4 style="text-align: center;">Implementation Details</h4>
<br>

**Gradient Ascent**
```python
class GA_FT(GA):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    def compute_loss(self, model, inputs, return_outputs=False):
        forget_data = inputs["forget"]

        forget_inputs = {
            "input_ids": forget_data[0],
            "attention_mask": forget_data[1],
            "labels": forget_data[2],
        }

        retain_data = inputs["retain"]

        retain_inputs = {
            "input_ids": retain_data[0],
            "attention_mask": retain_data[1],
            "labels": retain_data[2],
        }

        forget_outputs = model(**forget_inputs)
        retain_outputs = model(**retain_inputs)

        loss = - forget_outputs.loss + self.gamma * retain_outputs.loss
        return (loss, forget_outputs) if return_outputs else loss

```

<br>

**KL Divergence**
```python
def kl_loss(prob_p, prob_q):
    return -(prob_p * torch.log(prob_q + 1e-12)).sum(-1).mean()

class KL_FT(BaseTrainer):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        print(f'KL+FT is selected as unlearning method: [gamma: {self.gamma}]')

    def compute_loss(self, model, inputs, return_outputs=False):
        forget_data = inputs["forget"]

        forget_inputs = {
            "input_ids": forget_data[0],
            "attention_mask": forget_data[1],
            "labels": forget_data[2],
        }

        retain_data = inputs["retain"]

        retain_inputs = {
            "input_ids": retain_data[0],
            "attention_mask": retain_data[1],
            "labels": retain_data[2],
        }

        forget_outputs = model(**forget_inputs)
        retain_outputs = model(**retain_inputs)

        with torch.no_grad():
            infer_forget_outputs = self.infer_model(**forget_inputs)
        

        prob_forget_p = torch.softmax(forget_outputs.logits, dim=-1)
        prob_forget_q = torch.softmax(infer_forget_outputs.logits, dim=-1)

        forget_loss = kl_loss(prob_forget_p, prob_forget_q)
        
        loss = -self.gamma * forget_loss + retain_outputs.loss

        return (loss, forget_outputs) if return_outputs else loss
```

<br>

**NPO**
```python
class NPO(BaseTrainer):
    def __init__(self, *args, **kwargs):
        print(args)
        print(kwargs)
        print("\n\n\n\nNPO is here ....\n\n\n\n")

        super().__init__(*args, **kwargs)

        print(self.gamma)

    def compute_loss(self, model, inputs, return_outputs=False):
        forget_data = inputs["forget"]

        forget_inputs = {
            "input_ids": forget_data[0],
            "attention_mask": forget_data[1],
            "labels": forget_data[2],
        }

        retain_data = inputs["retain"]
        retain_inputs = {
            "input_ids": retain_data[0],
            "attention_mask": retain_data[1],
            "labels": retain_data[2],
        }

        outputs = model(**forget_inputs)
        current_forget_loss = outputs.loss

        with torch.no_grad():
            ref_outputs = self.infer_model(**forget_inputs)
            ref_forget_loss = ref_outputs.loss
        
        neg_log_ratios = current_forget_loss - ref_forget_loss

        retain_outputs = model(**retain_inputs)
        retain_loss = retain_outputs.loss
        
        forget_loss = -torch.nn.functional.logsigmoid(0.1*neg_log_ratios).mean()*2/0.1

        loss = forget_loss + self.gamma * retain_loss

        return (loss, outputs) if return_outputs else loss
```

The above codes must be included in `src/unlearn/` directory. `src/unlearn/` directory in our experiments can be found [here](unlearn/).


[^1]: <small>https://github.com/OPTML-Group/SOUL</small>
[^2]: <small>https://arxiv.org/pdf/2404.18239</small>
