Abstract: As an important field of artificial intelligence, human-machine dialogue is regarded as the main form of the new generation of human-machine interaction. Due to its convenience, human-machine dialogue is widely used in many business scenes, such as task-based dialogue systems and chat robots. In real-life scenes, conversations are often accompanied by emotional exchanges. However, without emotion perception and expression capabilities, human-machine dialogue technology fails to complete emotional communication in such scenes. In order to make up for the lack of emotional intelligence in human-computer dialogue technology, the deep learning based emotional dialogue response task has been proposed and developed into an important research direction in the field of dialogue.
In this paper, we first review the development of deep learning based emotional dialogue response task. According to different goals, we then divide the task into five subtasks: controllable emotional dialogue generation, empathetic dialogue response, emotion support, multimodal emotion dialogue generation, and new tasks. Controllable emotional dialogue generation focuses on how to generate responses with specified emotions. According to different components for handling emotions, we divide these models into emotion perception models, emotion representation models, and emotion perception representation models. Empathetic dialogue response aims to automatically perceive emotions and express empathetic responses. According to different factors influencing empathetic perception and expression, we divide this category of models into emotion factor-based models, compound factor-based models, and structural factor-based models. Emotion support is to regulate the speaker's emotion to comfort the user's feelings. Due to the different types of emotions involved, we divide this type of models into three subtypes: stimulating positive emotions, stimulating specified emotions, and reducing negative emotions. Multimodal emotion dialogue generation models focus on multiple modalities, such as images, audio, and text. Since the number of models in this category is relatively small, we do not further classify them. Subsequently, we also group the new tasks proposed in recent years into one category, including three sub-tasks: multi-person empathic dialogue, value task, and language toxicity mitigation. Based on the above categories, we further compare the advantages and disadvantages of the models and thus look forward to the development trends of each subtask.
In order to further explore the deep learning based emotional dialogue response, we also organize and analyze the models according to commonly used structures. Since most models are built based on sequence-to-sequence structures, we list and analyze the improvement methods of sequence-to-sequence structures. Given the rapid development of pre-training models in recent years, some models have adopted pre-training models to enhance the effectiveness of the models. We also summarize the pre-training based models on deep learning-based emotional dialogue response task. At the same time, we also summarize more structures involved in this task, including generative adversarial networks, reinforcement learning, dual learning, and template filling. To further explore the advantages and disadvantages of the above structures, we conduct further comparisons and analysis of these models.
Then, we introduce the commonly used data and evaluation metrics. Finally, we summarize the models and further discuss the future development directions of this task based on the summary.
Loading