InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning

Zhexin Zhang; Jiale Cheng; Hao Sun; Jiawen Deng; Minlie Huang

InstructSafety: A Unified Framework for Building Multidimensional and Explainable Safety Detector through Instruction Tuning

Zhexin Zhang, Jiale Cheng, Hao Sun, Jiawen Deng, Minlie Huang

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Ethics in NLP

Keywords: safety detection, unified framework, instruction tuning

TL;DR: A unified framework to build multidimensional and explainable safety detector through instruction tuning.

Abstract: Safety detection has been an increasingly important topic in recent years and it has become even more necessary to develop reliable safety detection systems with the rapid development of large language models. However, currently available safety detection systems have limitations in terms of their versatility and interpretability. In this paper, we first introduce InstructSafety, a safety detection framework that unifies 7 common sub-tasks for safety detection. These tasks are unified into a similar form through different instructions. We then conduct a comprehensive survey of existing safety detection datasets and process 39 human-annotated datasets for instruction tuning. We also construct adversarial samples to enhance the model's robustness. After fine-tuning Flan-T5 on the collected data, we have developed Safety-Flan-T5, a multidimensional and explainable safety detector. We conduct comprehensive experiments on a variety of datasets and tasks, and demonstrate the strong performance of Safety-Flan-T5 in comparison to supervised baselines and served APIs (Perspective API, ChatGPT and InstructGPT). We will release the processed data, fine-tuned Safety-Flan-T5 and related code for public use.

Submission Number: 3522

Loading