CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference

ACL ARR 2024 June Submission2889 Authors

15 Jun 2024 (modified: 02 Aug 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research issue. Previous red teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.

Paper Type: Short

Research Area: Linguistic theories, Cognitive Modeling and Psycholinguistics

Research Area Keywords: security/privacy; benchmarking; hate speech detection;

Contribution Types: Data resources

Languages Studied: English

Submission Number: 2889

Loading