Abstract: The social impact of natural language processing and its applications has received increasing attention. Here, we focus on the problem of safety for end-to-end conversational AI. We survey the problem landscape therein, introducing a taxonomy of three observed phenomena: the Instigator, Yea-Sayer, and Impostor effects. To help researchers better understand the impact of their conversational models with respect to these scenarios, we present Safety Bench, a set of open-source tooling for quickly assessing safety issues. Finally, we provide extensive analysis of these tools using five popular models and make recommendations for future use.
0 Replies