<div align="center">

**Detoxifying Large Language Models via Knowledge Editing**

![](https://img.shields.io/badge/version-v0.0.1-blue)
[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
![](https://img.shields.io/badge/PRs-Welcome-red)

---

<p align="center">
  <a href="#-detoxifying-via-knowledge-editing">Overview</a> •
  <a href="#-dinm">DINM</a> •
  <a href="#-how-to-run">How to Run</a> •
  <a href="#-track-2-of-task-10-for-nlpcc-2024">NLPCC 2024</a> •
  <a href="#-citation">Citation</a> •
  <a href="https://arxiv.org/abs/2403.14472">Paper</a> •
  <a href="https://zjunlp.github.io/project/SafeEdit">Website</a> 

  <!-- <a href="#💡-detoxifying-via-knowledge-editing">Overview</a> •
  <a href="#🚀-dinm">DINM</a> •
  <a href="#🌟-how-to-run">How to Run</a> •
  <a href="#📚-track-2-of-task-10-for-nlpcc-2024">NLPCC 2024</a> •
  <a href="#📖-citation">Citation</a> •
  <a href="https://arxiv.org/abs/2403.14472">Paper</a> •
  <a href="https://zjunlp.github.io/project/SafeEdit">Website</a>  -->
</p>

</div>

<div align="center">
This paper has been accepted by ACL 2024.
</div>

# 💡 Detoxifying via Knowledge Editing

<div align=center>
<img src="../figs/safety_task.gif" width="70%" height="70%" />
</div>

## Task Definition

**Detoxifying LLM** strives to build a safe and trustworthy large  language model (LLM).
Knowledge editing focuses on specific areas for permanent adjustment without compromising overall performance. 
Then, detoxifying LLM via knowledge editing leverages a small amount of data, usually an instance, to correct the toxic behaviors of the LLM. The edited LLM can defend against various malicious inputs.





## Evaluation

We extend evaluation metrics to Defense Duccess (DS), Defense Generalization (DG), and General Performance.
- Defense Duccess (DS): the detoxification success rate of edited LLM for an adversarial input (attack prompt + harmful question), which is used to modify LLM.

- Defense Generalization (DG): the detoxification success rate of edited LLM for out-of-domain (OOD) malicious inputs.
  - `DG of only harmful question`($\mathrm{DG}_\text{onlyQ}$): the detoxification success rate for only harmful question.
  - `DG of other attack prompts`($\mathrm{DG}_\text{otherA}$): the detoxification success rate for unseen attack prompts.
  - `DG of other harmful questions`($\mathrm{DG}_\text{otherQ}$): the detoxification success rate for unseen harmful questions.
  - `DG of other attack prompts and questions`($\mathrm{DG}_\text{otherAQ}$): the detoxification success rate for unseen attack prompts and harmful questions.

- General Performance: the side effects for unrelated task.

  - `Fluency`: n-gram of responses generated by edited LLM for malicious inputs.
  - `KQA`: the success rate of knowledge question answering on [TriviaQA](https://arxiv.org/pdf/1705.03551.pdf).
  - `CSM`: ROUGE-1 of content summarization ability on [Xsum](https://arxiv.org/pdf/1808.08745.pdf).


We evaluate DS and DG by [SafeEdit-Safety-Classifier](https://huggingface.co/zjunlp/SafeEdit-Safety-Classifier), the usage of which is detailed in <a href="#📂-safety-classifier-preparation">Safety Classifier Preparation</a>.
The statistics of Fluency can be found in our EasyEdit.
We evaluate KQA and CSM by [OpenCompass](https://github.com/open-compass/opencompass).


## 🚀 DINM

Inspired by intraoperative neurophysiological monitoring, we design a simple yet effective knowledge editing baseline called Detoxifying with Intraoperative Neural Monitoring (DINM). 
DINM uses an instance to locate and edit toxic regions of the LLM.



<div align=center>
<img src="../figs/DINM.png" width="70%" height="70%" />
</div>

# 🌟 How to Run

<!-- ## 🎍 Current Implementation
As the main Table of our paper, two editing methods are supported for detoxifying LLM by EasyEdit.
| **Method** | LlaMA2-7B-Chat | Mistral-7B-v0.1
| :--------------: | :--------------: | :--------------: | 
| MEND | ✅ | ✅ |
| DINM | ✅ | ✅ |  -->




## 🔧 Pip Installation


To get started, simply install conda and run:

```shell
git clone https://github.com/zjunlp/EasyEdit.git
conda create -n EasyEdit python=3.9.7
...
conda activate EasyEdit
pip install -r requirements.txt
```

> ❗️❗️ If you intend to use Mistral, please update the `transformers` library to version 4.34.0 manually. You can use the following code: `pip install transformers==4.34.0`.

---


## 📂 Data Preparation

**Dataset for Detoxifying LLM via Knowledge Editing: SafeEdit** 
You can download it from [[Hugging Face]](https://huggingface.co/datasets/zjunlp/SafeEdit), then put the data in folder "./data".
**"SafeEdit_test.json"** is the test data file containing 1350 instances.


## 📂 Safety Classifier Preparation

SafeEdit-Safety-Classifier, we used for judgment, is hosted on [Hugging Face](https://huggingface.co/zjunlp/SafeEdit-Safety-Classifier). 
You can use the Safety Classifier:
```shell
from transformers import RobertaForSequenceClassification, RobertaTokenizer
safety_classifier_dir = 'zjunlp/SafeEdit-Safety-Classifier'
safety_classifier_model = RobertaForSequenceClassification.from_pretrained(safety_classifier_dir)
safety_classifier_tokenizer = RobertaTokenizer.from_pretrained(safety_classifier_dir)
```
You can also download [SafeEdit-Safety-Classifier](https://huggingface.co/zjunlp/SafeEdit-Safety-Classifier), and put the judgment model to your own path.
When running the [run_safety_editing.py](https://github.com/zjunlp/EasyEdit/blob/main/examples/run_safety_editing.py) file, you only need to provide safety_classifier_dir to use this classifier.


## 💻 Run

Before you begin running the program, ensure that the necessary files are present and properly set up, specifically the directories **./data, ./hparams,**. 



Our method supports multi-GPU editing. You can try setting the `model_parallel` to `true` in the configuration file `../hparams/DINM/mistral-7b` to enable multi-GPU editing.
```shell
python run_safety_editing.py --editing_method=DINM --edited_model=mistral-7b --hparams_dir=../hparams/DINM/mistral-7b --safety_classifier_dir=zjunlp/SafeEdit-Safety-Classifier --metrics_save_dir=../safety_results
```

> ❗️❗️ You can download SafeEdit-Safety-Classifier manually to your own path, and set safety_classifier_dir to your local path.
Then, you can obtain the evaluation for DS, DG, and Fluency in the path ../safety_results.
For KQA and CSM evaluations, please use [OpenCompass](https://github.com/open-compass/opencompass).


> ❗️❗️ A friendly reminder: if you use SafeEdit dataset for evaluation, it is recommended to set max_output_length to 600 in mistral-7b.yaml (if you don't use mistral-7b.yaml, please replace your own .yaml file). For some role-playing attack prompts, LLMs may initially generate safe responses and then suddenly generate toxic text.
Considering the maximum length of certain LLMs may not suffice; you can truncate the input length (from right to left, as harmful questions typically appear on the right).

# 🎍 Demo

Here is the demo introduction of detoxifying Mistral-7B-v0.1 on one A800 GPU by DINM. 
You can download the [demo video](https://github.com/zjunlp/EasyEdit/blob/main/figs/SafeEdit_demo.mp4) and use [SafeEdit_demo](https://github.com/zjunlp/EasyEdit/tree/main/demo/SafeEdit_demo) to get started quickly.

- Click the button **Edit**: DINM use an instace to locate and edit toxic regions of Mistral-7B-v0.1. Then, we can obtain the toxic layer of Mistral-7B-v0.1, and edited Mistral-7B-v0.1.
- Click the button **Generate** of Defense Success: Edited Mistral-7B-v0.1 generates response for adversarial input, which is used for Defense Success metric.
- Click the button **Generate** of Defense Generalization: Edited Mistral-7B-v0.1 generates response for out-of-domain malicous input, which is used for Defense Generalization metric.

<div align=center>
<img src="../figs/SafeEdit_demo_gif.gif" width="70%" height="70%" />
</div>

# 📚 Track 2 of Task 10 for NLPCC 2024

## Training 
Please refer to [this link](https://github.com/eric-mitchell/direct-preference-optimization) for the code of the SFT and DPO.

For DINM method, you should first complete the <a href="##📂 Data Preparation ">Data Preparation</a>.

Second, move the file **[train_DINM_for_NLPCC.py](https://github.com/zjunlp/EasyEdit/blob/main/examples/train_DINM_for_NLPCC.py)** to **./** (We will later modify the code to adapt to running in the current directory), and run:
```shell
python train_DINM_for_NLPCC.py --hparams_dir ./hparams/DINM/llama-7b --results_save_dir ./safety_results
```



## Testing
To evaluate the detoxifying performance of edited model, you should move the file **[test_detoxify_generate_for_NLPCC.py](https://github.com/zjunlp/EasyEdit/blob/main/examples/test_detoxify_generate_for_NLPCC.py)** to **./** (We will later modify the code to adapt to running in the current directory), and run:
```shell
python test_detoxify_generate.py --edited_LLM_ckpt ./safety_results/dinm_llama2-chat --tok_ckpt ./hugging_cache/llama-2-7b --results_save_dir ./safety_results
```
> ❗️❗️ Please set max_output_length to 600 in llama-7b.yaml. For some role-playing attack prompts, LLMs may initially generate safe responses and then suddenly generate toxic text. Therefore, you should set enough max_output_length to evaluate the safety of LLM.

# 📖 Citation

Please cite our paper if you use **SafeEdit**, **SafeEdit-Safety-Classifier** and **DINM** in your work.





```bibtex
@misc{wang2024SafeEdit,
      title={Detoxifying Large Language Models via Knowledge Editing}, 
      author={Mengru Wang, Ningyu Zhang, Ziwen Xu, Zekun Xi, Shumin Deng, Yunzhi Yao, Qishen Zhang, Linyi Yang, Jindong Wang, Huajun Chen},
      year={2024},
      eprint={2403.14472},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

# 🎉 Acknowledgement


We are deeply grateful to [Yue Zhang](https://scholar.google.co.uk/citations?user=6hA7WmUAAAAJ&hl=en) from Westlake University and [Xing Xie](https://www.microsoft.com/en-us/research/people/xingx/representative-publications/) from Microsoft Research Asia for their insightful feedback and constructive suggestions, which greatly enhanced the quality of this paper. 
We would like to express our heartfelt gratitude for Minlie Huang and team members from Tsinghua University for the contributions of [Safety Benchmark](https://arxiv.org/pdf/2309.07045.pdf) and [Assessmen](https://doi.org/10.48550/arXiv.2304.10436), Tatsunori B. Hashimoto and his team for the contributions of [instructions following data](https://github.com/tatsu-lab/alpaca_eval), [Jiahao Yu](https://doi.org/10.48550/arXiv.2309.10253), [Yang Li](https://doi.org/10.48550/arXiv.2305.13860), [Shujian Huang](https://doi.org/10.48550/arXiv.2311.08268), [Danqi Chen](https://doi.org/10.48550/arXiv.2310.06987), and [Jacob Steinhardtfor](https://doi.org/10.48550/arXiv.2307.02483) their contributions of security attack technique. 
We utilize portions of their attack prompts and unsafe category in this paper and express sincere gratitude.
We also extend our thanks to Andrew Lee. 
Inspired by [Andrew Lee's research](https://doi.org/10.48550/arXiv.2401.01967) , we delve into a preliminary mechanistic analysis of SFT, DPO, and our DINM.
Besides, we extend special thanks to Zhexin Zhang form Tsinghua university for providing valuable insights on conducting fair comparisons between traditional and knowledge editing methods in our experiments.
