## 1 Introduction

In recent years, conversational Large Language Models (LLMs) 1 have undergone rapid development (Touvron et al., 2023; Chiang et al., 2023; OpenAI, 2023a), showing powerful conversation capabilities in diverse applications (Bubeck et al., 2023; Chang et al., 2023). However, LLMs can also be exploited during conversation to facilitate harmful activities such as fraud and cyberattack, presenting significant societal risks (Gupta et al., 2023; Mozes et al., 2023; Liu et al., 2023b). These risks include the propagation of toxic content (Gehman et al., 2020), perpetuation of discriminatory biases (Hartvigsen et al., 2022), and dissemination of misinformation (Lin et al., 2022).

The growing concerns regarding LLM conversation safety - specifically, ensuring LLM responses are free from harmful information - have led to extensive research in attack and defense

1 The LLMs we investigate in our study specifically refer to autoregressive conversational LLMs, which include two types: Pre-trained Large Language Models (PLLMs) like llama-2 and GPT-3, and Fine-tuned Large Language Models (FLLMs) such as Llama-2-chat, ChatGPT, and GPT-4.

strategies (Zou et al., 2023; Mozes et al., 2023; Li et al., 2023d). This situation underscores the urgent need for a detailed review that summarizes recent advancements in LLM conversation safety, focusing on three main areas: 1) LLM attacks, 2) LLM defenses, and 3) the relevant evaluations of these strategies. While existing surveys have explored these fields to some extent individually, they either focus on the social impact of safety issues (McGuffie and Newhouse, 2020; Weidinger et al., 2021; Liu et al., 2023b) or focus on a specific subset of methods and lack a unifying overview that integrates different aspects of conversation safety (Schwinn et al., 2023; Gupta et al., 2023; Mozes et al., 2023; Greshake et al., 2023).

Therefore, in this survey, we aim to provide a comprehensive overview of recent studies on LLM conversation safety, covering LLM attacks, defenses, and evaluations (Fig. 1, 2). Regarding attack methods ( Sec. 2 ), we examine both inferencetime approaches that attack LLMs through adversarial prompts, and training-time approaches that involve explicit modifications to LLM weights. For defense methods ( Sec. 3 ), we cover safety alignment, inference guidance, and filtering approaches. Furthermore, we provide an in-depth discussion on evaluation methods ( Sec. 4 ), including safety datasets and metrics. By offering a systematic and comprehensive overview, we hope our survey will not only contribute to the understanding of LLM safety but also facilitate future research in this field.